Comparative whole-genome analyses of selection marker–free rice-based cholera toxin B-subunit vaccine lines and wild-type lines

We have developed a rice-based oral cholera vaccine named MucoRice-CTB (Cholera Toxin B-subunit) by using an Agrobacterium tumefaciens–mediated co-transformation system. To assess the genome-wide effects of this system on the rice genome, we compared the genomes of three selection marker–free MucoRice-CTB lines with those of two wild-type rice lines (Oryza sativa L. cv. Nipponbare). Mutation profiles of the transgenic and wild-type genomes were examined by next-generation sequencing (NGS). Using paired-end short-read sequencing, a total of more than 300 million reads for each line were obtained and mapped onto the rice reference genome. The number and distribution of variants were similar in all five lines: the numbers of line-specific variants ranged from 524 to 842 and corresponding mutation rates ranged from 1.41 × 10−6 per site to 2.28 × 10−6 per site. The frequency of guanine-to-thymine and cytosine-to-adenine transversions was higher in MucoRice-CTB lines than in WT lines. The transition-to-transversion ratio was 1.12 in MucoRice-CTB lines and 1.65 in WT lines. Analysis of variant-sharing profiles showed that the variants common to all five lines were the most abundant, and the numbers of line-specific variant for all lines were similar. The numbers of non-synonymous amino acid substitutions in MucoRice-CTB lines (15 to 21) were slightly higher than those in WT lines (7 or 8), whereas the numbers of frame shifts were similar in all five lines. We conclude that MucoRice-CTB and WT are almost identical at the genomic level and that genome-wide effects caused by the Agrobacterium-mediated transformation system for marker-free MucoRice-CTB lines were slight. The comparative whole-genome analyses between MucoRice-CTB and WT lines using NGS provides a reliable estimate of genome-wide differences. A similar approach may be applicable to other transgenic rice plants generated by using this Agrobacterium-mediated transformation system.


Background
Production of pharmaceutical ingredients by using plant expression systems (plant-made pharmaceuticals, or PMPs) has become a promising technology [1,2]. The advantages of producing PMPs compared to conventional production systems (such as large-scale bacterial fermentation) are as follows: cost-effectiveness, adaptability for scaling up and possibility to produce eukaryotic proteins with correct 3-dimensional structures [1,3]. Plantbased systems such as transient expression or transgenic systems developed during the last two decades have been reviewed by Paul and Ma [4]. In both systems, higher protein yield has been achieved through improvement of the expression vectors [4,5]. Despite intensive efforts aimed at PMP marketing, only glucocerebrosidase produced in plant cell culture for treatment of Gaucher's disease has been approved for human use [6].
We previously reported MucoRice-CTB, transgenic rice expressing cholera toxin B-subunit (CTB) designed as an oral vaccine against cholera [7]. MucoRice provides a suitable vehicle for expression, accumulation, and mucosal delivery of antigens that are not only stable at room temperature for several years without loss of immunogenicity, but are also protected from digestive enzymes in the gastrointestinal tract. Oral vaccination of mice and macaques with MucoRice-CTB resulted in the induction of antigen-specific serum IgG and mucosal IgA responses with toxin-neutralizing immunity [8]. Because of sequence similarity between cholera toxin (CT) and heat-labile enterotoxin from enterotoxigenic Escherichia coli, MucoRice-CTB successfully induced protective immunity against both Vibrio cholerae-induced and enterotoxigenic E. coli-induced diarrhea [9]. We also achieved high-yield CTB production in rice seeds by using a CTB overexpression system together with an RNA interference (RNAi) cassette to suppress the production of major endogenous storage proteins, prolamin 13 kDa and glutelin A. The amount of CTB produced in rice endosperm without RNAi reached only 1/6 of that of MucoRice-CTB with RNAi [10].
To perform a phase I study as the first step towards human application of MucoRice-CTB, we have recently established a selection marker-free line (51A) as a seed bank by using co-transformation with two different Agrobacterium tumefaciens strains, each carrying a distinct T-DNA containing either a selection marker cassette or the CTB and RNAi cassettes [11]. This Agrobacteriummediated transformation system includes several steps: (1) sterilizing Nipponbare seeds with sodium hypochlorite solution, (2) induction of calli with plant hormones, (3) transformation with Agrobacterium carrying T-DNA, (4) regeneration in the presence of plant hormones followed by cultivation under antibiotic pressure, (5) propagation of the three MucoRice-CTB lines for at least five generations by self-pollination to fix the desired transgene. Since it has been reported that the Agrobacterium-mediated transformation system may cause genomic changes in the host organisms [12,13], it is essential to assess the effects of our Agrobacterium-mediated transformation system on the genome of MucoRice-CTB seed bank intended for human use.
Recently, next-generation sequencing (NGS) has greatly influenced the discovery of genetic markers [14] and facilitated transcriptomic approaches [15] in various organisms. Furthermore, the increasing availability of reference genomes has promoted resequencing in a wider variety of species. Resequencing allows detecting substantial numbers of genomic variations including single-nucleotide polymorphisms (SNPs) and insertions and deletions (InDels) between the target and reference genomes [16]. In addition to revealing the differences between rice subspecies (japonica and indica), resequencing analysis has provided insights into the diversity of domesticated rice [17][18][19].
In this study, using NGS, we investigated the genomic differences between three selection marker-free MucoRice-CTB lines, including the line 51A intended for phase I clinical trial, and two wild-type (WT) rice lines (Oryza sativa L. cv. Nipponbare from two different sources). The three MucoRice-CTB lines were selected by the level of CTB protein production and elimination of the marker gene used for the initial transformant selection [11]. We found that these MucoRice-CTB and WT lines are almost identical at the genomic level. The type of comparative analysis reported here can be used to estimate genome alterations not only in MucoRice-CTB but also in other transgenic rice plants generated by using a similar Agrobacterium-mediated transformation system.

Read alignment to the rice reference genome
Genomic DNAs of five rice lines (three marker-free MucoRice-CTB lines, 50A, 51A, and 55A; two Nipponbare WT lines, WT1 and WT2) were sequenced by NGS. After filtering to exclude reads with low sequence-quality scores, more than 300 million paired-end reads were obtained for each line ( Table 1). The reads from each line were aligned separately to the rice reference genome [20,21]. In addition to the 12 rice chromosomes, nucleotide sequences of the CTB expression construct, hygromycin resistance gene (hygromycin phosphotransferase: HPT) used as a selection marker, and the binary vector used for Agrobacterium-mediated transformation were added to the reference to examine whether these sequences Mapping rate represents the ratio of the number of mapped reads to that of total reads. Covered length represents the number of genome bases covered with at least one read. Coverage rate is the ratio of covered length to the total length of the rice reference genome (373,245,519 bps, IRGSP-1.0, build 5 [21]). Depth was calculated by dividing the total length of all mapped reads (100 bps for each read) by covered length.
are integrated into the genomes. The resulting mapping rates ranged from 97.1 to 98.1% (Table 1). In all results of mapping of MucoRice-CTBs, we confirmed that the HPT gene had been segregated and excised during the passage of generations (Additional file 1: Figure S1). The coverage rate ranged from 99.1 to 99.6%, whereas the depth (the average number of reads covering a genome) ranged from 84.3 to 101.3 (Table 1).

Variant calling and distribution
The total numbers of detected variants (SNPs and InDels) ranged from 19,103 to 20,623 in the examined lines ( Table 2). Variant distribution profiles were calculated in non-overlapping, consecutive 500-kbp windows for each chromosome, and the averages for MucoRice-CTBs and WTs were compared ( Figure 1A). On chromosomes 1, 2, 4, 10, and 12, regions with higher variant density than other regions on the same chromosomes were observed. On chromosomes 1 and 10, these regions located close to the centromeres. Variant distribution showed substantial consistency over most genome regions in MucoRice-CTBs and WTs, since we observed the variant densities varied from 0 (e.g., between 9 and 9.5 Mbp on chromosome 1 in MucoRice-CTBs and WTs) to 648.0 (between 8 and 8.5 Mbp on chromosome 10 in WT lines), whereas the differences for every corresponding 500-kbp windows throughout the genome of MucoRice-CTBs and WTs were at most 16.7 on chromosome 10 ( Figure 1B).

Comparison of the types of nucleotide substitutions of SNPs
Line-specific SNPs were subdivided into transitions (Ts) and transversions (Tv). Nucleotide substitution profiles were obtained for each line ( Figure 2A). G to A and C to T were the most frequent Ts in both MucoRice-CTBs and WTs. We also found that the Tv frequency from G to T and C to A was increased only in MucoRice-CTBs. The Ts/Tv ratio for MucoRice-CTBs and WTs were 1.12 and 1.65, respectively ( Figure 2B). These results suggest that the substitution patterns in MucoRice-CTBs and WTs were similar.

Comparison of variant-sharing profiles among MucoRice-CTBs and WTs
All variants were classified as line-specific (defined as a variant without being shared by other lines), shared by two, three, or four lines, or common to all fivelines ( Figure 3). The most abundant variant in number (10,369) was of common type. Since the positions of all variants relative to the reference genome could be determined, we could define candidate line-specific variants by excluding the variants present in more than one line. Mutation rates throughout the genome were calculated by dividing the number of line-specific variants in each line by the covered genome length and ranged from 1.41 × 10 −6 to 2.28 × 10 −6 and the average number of line-specific variants was 720 (Table 2). Average numbers of linespecific variants for MucoRice-CTBs or WTs were similar for both totals and breakdowns (insertions, deletions, and SNPs).

Classification of variants by potential impact on protein function
Using SNPEff software and publicly available rice data sets [20], we predicted the effects of variants on protein function and categorized all of the line-specific variants into 23 effect types (  Three filters for improving the accuracy of each variant (described in the Methods section) were applied to the total variants (shown as "Total" in the second column). Line-specific variants such as Ins, insertions; Del, deletions and SNPs were selected in accordance with the sharing profile from filtered variants. Mutation rates represent probability of a mutation per nucleotide. type in this category) was slightly higher in 50A than in the other four lines.

Discussion
In this study, we analyzed three MucoRice-CTB lines and two untransformed (WT) lines, one of which was originally used to produce the MucoRice-CTB lines. Our purpose was to assess, using NGS, the genome-wide effects of the Agrobacterium-mediated transformation system used to generate the MucoRice-CTB lines by comparing them to WT lines, and to validate NGS as a useful tool to confirm the inheritance of the transgene over the passage of generations. The high-throughput Illumina HiSeq2000 platform provided a large number of paired-end short reads of superior quality from all samples. The coverage was >99% relative to the rice reference genome ( Table 1). The number of variants (SNPs and InDels) per 500 kbp varied greatly depending on the chromosome and position within the chromosome. Highly condensed repetitive sequences have been found around the centromeres in rice [23] and the biased amplification efficiency during NGS or the increased probability of mapping errors often occur within or around such regions [24]. Our results showed that there were variant-rich regions on chromosomes 1 and 10 in which we detected sharp peaks of variant distribution near the centromeres ( Figure 1A). The distribution of the average numbers of variants along the genome was similar in MucoRice-CTB and WT. Despite considerable variation in the variant densities among chromosomes, the differences between MucoRice-CTBs and WTs within the same region were small (at most 16.7 per 500 kbp; Figure 1B). These results suggest that MucoRice-CTBs and WTs have few differences in terms of genome-wide distribution of the number of variants.
Kawakatsu et al. [25] compared the variants in two mutant lines of cultivar Koshihikari: one line was generated by gamma radiation and ethyl methanesulfonate, and the second line was derived from the first one by Agrobacterium-mediated transformation. The transformation-  specific mutation rate was determined as 5.5 × 10 -7 /site. In our study, MucoRice-CTB-specific variant rates ranged from 1.41 × 10 -6 to 2.28 × 10 -6 , which are 2-4 times that from Kawakatsu's report. This difference may be due to the differences in cultivars or data analysis.
For the set of five lines, we analyzed line-specific variants, variants shared by two, three, and four lines, and those shared by all five lines (common variants) ( Figure 3); the common variants were the most abundant. No considerable difference was observed in the average numbers of line-specific variants (either total, SNPs, insertions or deletions) between WTs and MucoRice-CTBs (Figure 3). The pattern of nucleotide substitutions was similar and biased towards G to A and C to T in both groups (Figure 2A). This Ts has been reported to be caused by UV-radiation and the deamination of methylated C [26]. The Ts/Tv ratios determined in our study are similar to the Ts/Tv ratio for rice regenerated from long-term cell culture [27]. A higher number of G to T and C to A Tv was found in MucoRice-CTBs than in WTs; this might explain the difference in the Ts/Tv ratio between MucoRice-CTBs (1.12) and WTs (1.65) (Figure 2). The difference in Tv frequencies between MucoRice-CTBs and WT can be explained by observations of Cheng et al. [28], who reported that oxidized G (8-hydroxy-G), which is often detected in living cells, may pair with A instead of C, resulting in a subsequent change of G to T. In our study, sterilization with sodium chlorite prior to MucoRice-CTB callus generation may have caused G oxidation (followed by insufficient repair). Because the transformation system we used consists of several steps including seed sterilization, callus induction, co-transformation with Agrobacterium, plant regeneration, and the passage of generations, the specific factor(s) responsible for genomewide variations (other than oxidation by sodium hypochlorite during seed sterilization) remain to be elucidated.
Most line-specific variants (MucoRice-CTBs: 94.2%, WTs: 95.8%) belonged to the MODIFIER category (Table 3). According to the SNPEff manual [22], this category includes non-coding variants or variants affecting non-coding genes, which are unlikely to have marked effects on protein functions. The number of non-synonymous coding variants was slightly higher in all MucoRice-CTB lines than in WT lines and may have resulted from G oxidation mentioned above, leading to amino acid substitutions.
Since all MucoRice-CTB lines were generated from WT1 seed stock, we expected that variants from the WT1 would be inherited by MucoRice-CTBs. When the variants inherited to the progenies, the comparison of WT1 with MucoRice-CTBs should result in no different. However, line-specific variants were still observed in all lines and their numbers were similar (Figure 3). Some calli were chosen from different seed scutella; WT seeds used for generation of MucoRice-CTB lines and for genomic analysis were from different individuals. Therefore, line-specific variants in MucoRice-CTBs and WTs may be mainly due to individual differences within the same cultivar, and Agrobacterium-mediated transformation system may have only a limited effect on the genome.
Recently, seven domesticated and landrace cultivars were resequenced with NGS and compared with the rice reference cultivar Nipponbare. The total numbers of variants in these strains were 168,165 for Omachi, 158,310 for Yamadanishiki, 120,675 for Kameji, 180,402 for Gohyakumangoku, 147,639 for Koshihikari, 109,972 for Norin-8, and 987,045 for Moroberekan [29]. Another study reported 67,000 SNPs detected by NGS in Koshihikari in comparison with the rice reference [17]. In the present study, we used Nipponbare, the same cultivar as in the Rice Genome Project [30]; within each line, we detected~20,000 total variants and on average 720 line-specific variants (Table 2), which presumably resulted from individual differences in each line. Thus, the numbers of variants between different cultivars appear to be much larger than those between individual lines within the same cultivar.

Conclusions
We conclude that MucoRice-CTB and WT are almost identical at the genomic level and that the genome-wide effects in marker-free MucoRice-CTB lines were slight in comparison with the individual difference in WT seed stocks. Some difference in the prevalence of nucleotide substitutions between MucoRice-CTBs and WTs may be caused by the Agrobacterium-mediated transformation system. It is essential to find and to control the affecting factors. An accurate genome-wide assessment technology enabled by further improvements in NGS platform, in terms of both hardware and software, could become a key approach in manufacturing plant-made pharmaceuticals.

MucoRice-CTB and WT lines used
In a previous study, we established six HPT selection marker-free MucoRice-CTB lines by using two different A. tumefaciens strains, each carrying a distinct T-DNA vector for co-transformation [11]. The T-DNA vectors contained either the CTB gene with an RNAi cassette or   Line-specific variants (SNPs and InDels) that may affect protein function were categorized into 23 types. These types were further grouped into HIGH, MODERATE, LOW, and MODIFIER according to potential severity. The assignment criteria were pre-defined in the annotation program (SNPEff).
an HPT selection marker cassette. The two T-DNA vectors were introduced into calli and hygromycin-mediated selection was performed. Segregation of the HPT marker gene from the transformant genomes was achieved by the passage of generations. Marker-free transformants were then propagated for at least five generations obtained by self-pollination to fix the desired transgene. Line 51A of MucoRice-CTB was selected because it had the highest CTB expression as a seed bank for vaccine production for human use; the genomic location and structure of the transgenes were determined in this line [11]. In this study, three out of six selection marker-free MucoRice-CTB lines (50A, 51A, and 55A) and two WT rice lines of the same cultivar (WT1 and WT2) were analyzed by NGS. The WT1 stock was previously used to generate MucoRice-CTB; WT2 was maintained by a commercial seed provider. The removal of the selection marker gene and the presence of the CTB gene in three MucoRice-CTB lines were confirmed by PCR analysis (Additional file 1: Figure S1). Cultivation, including germination, was performed hydroponically in growth chambers (352-PJ, Panasonic, Japan). Approximately three-week-old seedlings were used for genomic DNA extraction.

PCR analysis
Genomic

Mapping reads to the reference genome
Mapping of the 100-bp short reads to the rice reference genome sequence (Os-Nipponbare-Reference-IRGSP-1.0 build 5) [20,21] was performed using Burrows-Wheeler Aligner (BWA ver. 0.5.9) [31]. The mapping function 'aln' of BWA was used to generate intermediate files. These were then used to generate SAM files (which contained mapped read information) by running the 'sampe' function. Both algorithms were used with default parameters. The SAM files, which are normally very large, were converted into binary BAM files by using the 'view' function of SAMtools [32]. The BAM files were then sorted by using the 'sort' function of SAMtools. Duplicate reads in sorted BAM files were removed with Picard tools [33] with the following parameters: REMOVE_DUPLICATES = true, AS = true, SORTING_COLLECTION_SIZE_RATIO = 0.1, and VALIDATION_STRINGENCY = LENIENT. Mapping rate was calculated as the ratio between the numbers of mapped reads and total reads. Coverage rate, which is the ratio between the length of the genomic region covered by at least one read and the length of the reference genome was calculated by identifying all uncovered regions in the genome using the 'genomeCoverageBed' function of the BEDTools package [34] with the option '-bga'.

Detecting SNPs and InDels
SNPs and short InDels between the mapped read data and the reference genome were called with SAMtools by using the mpileup function with '-uf' options and default parameters, and then the data format of 'bcf' was converted into 'vcf' with BCFTools [35]. We then used var-Filter in vcfutils (part of the SAMtools package) to remove variants covered by an excessive number of reads (>10,000). Called variants were annotated on the basis of information on gene structure and function from the Rice Annotation Project by using SNPEff (ver. 3.4) [36]. The potential effect of each variant on gene expression and protein structure or function was examined by SNPEff.

Variant filtration
All variants from the five lines were listed according to their genomic positions; to minimize the number of falsepositives, variant filtration was performed according to three criteria: (1) The phred-scaled score (calculated by mpileup in SAMtools) must be at least 30. This criterion guarantees the probability of false positives of ≤0.001. (2) The position of each variant must be covered by at least four reads in each of the five lines regardless of whether the variant was present at the position. Information on the number of reads covering specific positions was obtained by using the coverageBED function in BEDTools [34]. (3) If a variant is shared by more than one line, the alteration type needs to be the same; for example, if an SNP was detected at a certain position in one line whereas an insertion was detected in the same position in another line, these variants were excluded. This criterion was adopted to create the variant-sharing profile, i.e. a 'shared' variant needs to be of the same type and be present at the same position.

Calculation of mutation rates
Mutation rates were calculated by dividing the total number of each line-specific variants by covered length.