Characterization of the Candida orthopsilosis agglutinin-like sequence (ALS) genes

Agglutinin like sequence (Als) cell-wall proteins play a key role in adhesion and virulence of Candida species. Compared to the well-characterized Candida albicans ALS genes, little is known about ALS genes in the Candida parapsilosis species complex. Three incomplete ALS genes were identified in the genome sequence for Candida orthopsilosis strain 90–125 (GenBank assembly ASM31587v1): CORT0C04210 (named CoALS4210), CORT0C04220 (CoALS4220) and CORT0B00800 (CoALS800). To complete the gene sequences, new data were derived from strain 90–125 using Illumina (short-read) and Oxford Nanopore (long-read) methods. Long-read sequencing analysis confirmed the presence of 3 ALS genes in C. orthopsilosis 90–125 and resolved the gaps located in repetitive regions of CoALS800 and CoALS4220. In the new genome assembly (GenBank PQBP00000000), the CoALS4210 sequence was slightly longer than in the original assembly. C. orthopsilosis Als proteins encoded features well-known in C. albicans Als proteins such as a secretory signal peptide, N-terminal domain with a peptide-binding cavity, amyloid-forming region, repeated sequences, and a C-terminal site for glycosylphosphatidylinositol anchor addition that, in yeast, suggest localization of the proteins in the cell wall. CoAls4210 and CoAls800 lacked the classic C. albicans Als tandem repeats, instead featuring short, imperfect repeats with consensus motifs such as SSSEPP and GSGN. Quantitative RT-PCR showed differential regulation of CoALS genes by growth stage in six genetically diverse C. orthopsilosis clinical isolates, which also exhibited length variation in the ALS alleles, and strain-specific gene expression patterns. Overall, long-read DNA sequencing methodology was instrumental in generating an accurate assembly of CoALS genes, thus revealing their unconventional features and first insights into their allelic variability within C. orthopsilosis clinical isolates.

Introduction resulting data provide insight into the ALS gene family in C. orthopsilosis and the basis for functional characterization.

Fungal strains and growth conditions
The C. orthopsilosis type strain ATCC 96139 [16] and the genome sequencing strain 90-125 [13] were included in this study, along with 4 clinical isolates (124, 85, 331, and 488) that were part of a strain collection deposited at the Department of Biology, University of Pisa. C. orthopsilosis strains were maintained as 30% glycerol frozen stocks at -20˚C or -80˚C and cultured on YPD agar plates (per liter: 10 g yeast extract, 20 g peptone, 20 g dextrose, 15 g agar). YPD liquid medium was used for routine growth at 30˚C with shaking.
Genomic DNA preparation C. orthopsilosis genomic DNA for PCR amplification was extracted after an overnight incubation at 30˚C in YPD medium with shaking. Cells were resuspended in a lysis buffer, broken with glass beads, and the resulting suspension extracted with phenol:chloroform:isoamyl alcohol (25:24:1) as described previously [16]. Following RNase treatment, DNA was precipitated with 2 volumes of isopropanol and 10 μl of 4 M ammonium acetate. The pellet was dried and dissolved in 50 μl of TE (pH 8.0).
C. orthopsilosis genomic DNA for long-read sequencing was extracted from cells that were grown for 16 h at 37˚C in YPD medium with 200 rpm shaking. Cells were treated with zymolyase to form spheroplasts that were lysed with sodium dodecyl sulfate. Gentle mixing by inversion was used to handle the spheroplasts, and during phenol extractions and isopropyl alcohol precipitation of DNA [17].

Genome sequence data generation and assembly
New genome data were derived from strain 90-125 using Illumina (short-read) and Oxford Nanopore (long-read) methods. MiSeq shotgun genomic libraries were prepared with the Hyper Library construction kit (Kapa Biosystems). The library was quantitated by qPCR and sequenced on one MiSeq flowcell for 151 cycles from each end of the fragment using a MiSeq 300-cycle sequencing kit (version 2). FASTQ files were generated and demultiplexed with the bcl2fastq Conversion Software (Illumina, version 2.17.1.14). MiSeq reads were quality trimmed using Trimmomatic [18] with the parameters "LEADING:30 TRAILING:30" prior to assembly. MiSeq yielded 2,281,330 reads of 150 nt each.
For Oxford Nanopore long-read sequencing, 1 μg of genomic DNA was sheared in a gTube (Covaris, Woburn, MA) for 1 min at 6,000 rpm in a MiniSpin plus microcentrifuge (Eppendorf, Hauppauge, NY). The sheared DNA was converted into a Nanopore library with the Nanopore Sequencing kit (LSK-108) with the Expansion barcoding kit (EXP-NBD103; Oxford Nanopore, UK). The library was sequenced on a SpotONFlowcell MK I (R9.4) for 48 h using a MinION MK 1B sequencer. Base calling and demultiplexing were performed in real time with the Metrichor Agent V2.45.3 using the 1D Base Calling plus Barcoding for FLO-MIN_106D 450 bp workflow. Sixty nucleotides (nt) were removed from both ends of each Oxford Nanopore read. Reads longer than 1000nt were used in the final assembly. The Oxford Nanopore (ONP) flow cell yielded 40,744 reads for a total of 364,246,709 bp. The mean and median ONP read lengths were 8,940 and 8,754 bp, respectively with a minimum of 114 bp and a maximum of 98,108 bp.
Genome assembly was performed using Canu v1.4 [19] using default parameters with the command 'canu -p asm -d orthopsilosisgenomeSize = 14m useGrid = false -nanopore-raw C_orthopsilosis_trimmed.fastq' using the trimmed ONP FASTQ reads. ONP reads were then aligned against the assembly using bwa [20], and the alignment was then used to polish the assembly using nanopolish v 0.6.0 [21]. Finally, the trimmed MiSeq data was used to additionally polish the assembly using Pilon v1.21 [22].

Analysis of ALS allelic variation
PCR was used to amplify various CoALS fragments to detect allelic size variation. Primers were designed according to the genomic sequence of the strain 90-125 available in the Candida Gene Order Browser database (CGOB3, http://cgob3.ucd.ie; [27,28]; Table 1). PCRs used DreamTaq DNA Polymerase (Thermo Fisher Scientific); primers were synthesized by Sigma Genosys or Integrated DNA Technologies. Amplification of entire CoALS genes used Q5 High-Fidelity DNA polymerase (New England Biolabs, NEB). PCRs were heated at 98˚C for 30 s followed by 30 cycles of 98˚C (10 s), 68˚C (30 s), and 72˚C (2.5 min for ALS800 and ALS4210, and 3.5 min for ALS4220). A final 10-min extension 72˚C was performed. PCR products were migrated on a 0.8% agarose gel in Tris Acetate EDTA buffer (TAE). Molecular sizes were calculated in silico using Gel Analyzer 2010 software (http://www.gelanalyzer.com/ index.html) and either the GeneRuler 1 kb DNA ladder (Thermo Fisher Scientific) or 100 bp DNA ladder (NEB).

Quantitation of relative gene expression levels
Relative expression of the CoALS genes was determined by real-time reverse transcription (RT)-PCR starting from total RNA of C. orthopsilosis isolates. Each strain was inoculated in 10 ml of YPD and grown for 16 h at 30˚C with shaking. An aliquot (500 μl) of the pre-inoculum was then inoculated in 20 ml of fresh YPD broth and incubated for 1 h and 24 h at 30˚C. Total RNA was extracted using Nucleospin RNA (Macherey Nagel, Düren, Germany) according to manufacturer's instructions and treated with DNase (Macherey Nagel) to remove DNA contamination. RNA was eluted in 60 μl of RNase-free water and stored at -80˚C. The quality and quantity of the extracted RNA were determined spectrophotometrically in an UVette 220-1600 (10 mm path length, 100 ml of sample volume, Eppendorf, Milan, Italy). One μg of total     RNA in a 20-μl reaction volume was converted into cDNA with random primers, using the Reverse Transcription System kit (Promega), following manufacturer's instructions. An RTnegative control was included to ensure lack of genomic DNA contamination. Primer sequences for real-time PCR are shown in Table 1. Each PCR mixture (20 μl) contained 1 μl of cDNA, 10 μl of Sso Advanced universal SYBR Green supermix, 1 μl each of primers (final concentration 0.2 μM) and 7 μl of sterile MilliQ water. Real-time PCR was performed in 96-well plates on CFX96 Touch Real-Time PCR Detection System (BioRad) (95˚C incubation for 60 s, followed by 40 cycles of 95˚C incubation for 5 s and 58˚C for 15 s). C. orthopsilosis ACT1 was used as the reference gene ( Table 1). The transcription level of ALS genes was calculated using the 2 -ΔCt method [29]. RT-PCR results were evaluated by Repeated Measures ANOVA test, followed by Dunnett's Multiple Comparison Test. A P value <0.05 was considered statistically significant.

Identification and DNA sequence of C. orthopsilosis ALS genes
The C. orthopsilosis strain 90-125 genome sequence initially was accessed using CGOB3 (http://cgob3.ucd.ie; [27,28]) and three putative ALS genes were located. Subsequently, data available at http://ncbi.nlm.nih.gov/genome/12421 were used to more carefully describe the ALS genes in the reference genome assembly (ASM31587v1). One ALS gene was located on chromosome 2 (CORT_0B00800) and two more in tandem on chromosome 3 (CORT_0C04210 and CORT_0C04220; Fig 1). For simplicity, the gene names were abbreviated here as CoALS800, CoALS4210, and CoALS4220, respectively.
In silico analysis revealed that the sequences of CoALS800 and CoALS4220 were incomplete due to mis-assembly of repeated DNA sequences in the coding region ( Table 2).
The genome assembly was generated from short-read sequences (454 Life Sciences and Illumina) with the aid of paired-end Sanger sequence reads from a fosmid library [13]. Because fungal species tend to encode multiple ALS genes, each containing long stretches of repeated DNA, ALS genes are very difficult to assemble from short-read sequence data. The recent development of long-read DNA sequencing methodology provided the potential to produce sequence reads that span entire repeat regions. One drawback of the long-read technology is reduced accuracy of base calling [30], so Illumina data were also generated and incorporated into the genome assembly. The assembled genome was deposited in GenBank with the accession number PQBP00000000. The genome assembled into 10 contigs that mapped to the 8 chromosomal sequences defined by the reference genome assembly (ASM31587v1; Table 3). Long-read sequence data contributed to an improved assembly. For example, assembly ASM31587v1 had 242 contigs in 8 scaffolds, an N50 of 120 kb, and an L50 of 36. Assembly PQBP00000000 had no added Ns, an N50 of 1.59 Mb, and an L50 of 4.
The new genome assembly was searched using the Basic Local Alignment Search Tool (BLAST; https://blast.ncbi.nlm.nih.gov/Blast.cgi) with C. albicans Als3 (CaAls3) as the query (translated from GenBank accession number AY223552). BLAST results revealed the same three genes discussed above (CoALS4210, CoALS4220, CoALS800). Additional BLAST, using the CoALS sequences and other parts of known ALS genes as queries, failed to reveal additional genes suggesting that strain 90-125 encoded three ALS genes. The schematic of the chromosomal arrangement of the C. orthopsilosis ALS genes (Fig 1) accurately depicts both genome assemblies. Final sequences for the CoALS genes were deposited in GenBank under accession numbers MG799557 (CoALS800, 2499 bp), MG799558 (CoALS4210, 2457 bp), and MG799559 (CoALS4220, 6078 bp).
The information above was the most concise description of the path toward identifying the CoALS genes and validating their DNA sequences. Prior to generating the new genome assembly CoALS sequence assembly was attempted by subcloning and PCR amplification of various gene fragments, and Sanger sequencing of the resulting constructs and products. Other Gen-Bank deposits of strain 90-125 sequences were made during the course of the study and listed here for the sake of completeness. These included KJ679579 (which was identical to MG799557), KX961387 (a partial sequence including the 5' domain of CoALS4210, which was 100% identical to MG799558 in the region of overlap), and KY211672 (a partial CoALS4220 sequence, which was assembled using Xs to indicate unknown nucleotides within the tandem repeat region).
Comparisons between data from the different approaches suggested only minor differences. For example, CoALS4210 was predicted to be shorter in the ASM31587v1 assembly than the PQBP00000000 assembly. Validation methods pointed to the MG799558 sequence as the correct, final version. For CoALS4220, the long-read sequence technology provided an accurately sized template for assembly of the tandem repeat sequences. Sanger sequencing of subcloned fragments and PCR products in the different laboratories contributing to this manuscript were

Features of C. orthopsilosis Als proteins
C. orthopsilosis ALS genes were translated to visualize and compare the CoAls proteins (Fig 2). Protein features were compared to the well-characterized C. albicans proteins [2]. Each CoAls protein encoded a secretory signal sequence of 22 amino acids followed by an N-terminal (NT) domain of 312 or 313 amino acids. The CoAls NT domains were 81-87% identical, and shared 45-47% identity with NT-Als3 from C. albicans. Alignment of the NT-CoAls amino acid sequences with NT-Als3 for which the three-dimensional structure is known [7] showed conservation of the eight Cys that provide the NT-Als3 fold (Fig 3). This sequence similarity suggested conservation of adhesive function in the CoAls proteins. The NT domain was followed by a short sequence (AFR) that had amyloid-forming potential as defined by Tango [24]. The aggregative function of this sequence was demonstrated previously in C. albicans Als proteins [7,8]. Like CaAls3, the CoAls proteins had a Thr-rich region (T domain; 32-34% Thr) that followed the NT and AFR sequences. The boundaries of the T domain were based on evaluations of sequence data, rather than on functional data. Currently the T domain is bounded in C. albicans Als proteins by the end of the AFR and the start of the tandemly repeated sequences [7]. Of the newly described CoAls proteins, only CoAls4220 had tandemly repeated copies of a 36-aa sequence. Unlike C. albicans Als proteins, however, the length of selected repeat units in CoAls4220 was variable, with some repeat units lacking one or two amino acids. The region Cterminal to the tandem repeats in CoAls4220 was rich in Ser (30%) and Thr (15%) similar to C-terminal regions in C. albicans Als proteins.
Regions following the T domain in CoAls800 and CoAls4210 were different than those observed in other Als proteins. Compositions of the two proteins were very similar in this region (Fig 2). Both encoded two different short, imperfect repeated sequences. The motif SSSEPP was found in the region proximal to the T domain. Following a Ser/Thr/Pro-rich (58-62%) region, a GSGN motif was present. Each CoAls protein had a C-terminal sequence with hallmarks of a GPI anchor addition site. The C-terminal 20 aa were predicted to be cleaved in this process. Additional primers were designed to further dissect the location of the observed length variation (Fig 5). Strain and/or allelic variability was noted in the SSSEPP-encoding sequences of CoALS800 and CoALS4210. The GSGN-encoding sequences in these two genes were homogeneous in CoALS800, but variable in CoALS4210. Variability was observed in the 3' end of the CT-encoding domain of CoALS4220. These sequence differences suggest that mature CoAls proteins will be different sizes across strains, and that within a strain, alleles may produce proteins of different lengths. The mature (signal peptide removed) NT domains of CoAls800, CoAls4210, and CoAls4220 contained 8 Cys residues in conserved positions (highlighted in yellow), which are essential for the folding of C. albicans (Ca) NT-Als3 for which the three-dimensional structure was solved [7]. Conservation of NT-Als3 adhesive function in the CoAls proteins was also suggested by the presence of the invariant Lys (K59) located in the CaNT-Als3 binding cavity (highlighted in blue). The amino acid alignment was produced using Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo). Identical ( � ), conserved (:), and semi-conserved (.) amino acids are indicated below the alignment. Dashes in the sequence indicate gaps. The sequence of C. albicans Als3 (CaAls3; GenBank accession number AY223552) was used as a reference. https://doi.org/10.1371/journal.pone.0215912.g003

Real-time PCR analysis of C. orthopsilosis ALS gene expression
Quantitative expression of CoALS genes was measured in the clinical isolates and reference strains grown in YPD medium for 1 h and 24 h. Data were displayed as a heat map (Fig 6). Transcription levels for the three CoALS genes varied based on stage of growth. CoALS800 showed the lowest expression level at 1 h incubation in all the strains tested (P< 0.0001). Conversely, CoALS4220 was expressed more highly than the other two genes (P < 0.0001 at 1 h, P < 0.001 at 24 h), although its transcriptional level was lower at 24 h compared to 1 h. Strain differences in expression were observed for all CoALS genes.

Discussion
The study of microbial pathogenesis has been revolutionized by the availability of genome sequences for many species. Although sequence data can be generated and assembled into  (6). Each subfigure shows a schematic of the CoALS gene or its encoded protein and corresponding PCR products that were analyzed on ethidium-bromide-stained agarose gels. Flanking gray rectangles represent the position of PCR primers outside of the coding region. (A). Overall size differences of the CoALS genes in each strain were demonstrated using primers 5' and 3' of the coding region (depicted as arrows; primer sequences are detailed in Table 1). Size markers (in kb) are indicated on the left of each gel image. Experiments used either GeneRuler 1 kb DNA ladder (Thermo Fisher Scientific) or 100 bp DNA ladder (NEB). Dissection of the source of the allelic variation in genes CoALS800 (B), CoALS4210 (C), and CoALS4220 (D) indicated variability in the sequences encoding the C-terminal regions of each protein and the tandem repeats in CoALS4220. Primers are labeled with lowercase letters that correspond to the labels on the agarose gel images. Sizes of fragments encoding the AFR and T domains were not detectably different between strains.
https://doi.org/10.1371/journal.pone.0215912.g004 draft genome files at a rapid pace, some genes have features that defy accurate representation in these resources. Examples include genes that belong to families with many similar loci, and open reading frames that encode multiple copies of repetitive sequences. Genes in the ALS family possess both features, and as such, are often mis-assembled in available genome sequences.
Long-read DNA sequence methodology is one answer to this problem. Although long-read methods produce data with a lower accuracy in base calling [30], the method is attractive for studying the ALS family since the long-read sequence can provide a template upon which shorter-read data (i.e. Illumina) can be assembled. Work presented here demonstrates the utility of this approach. The combination of methods provided an accurate and complete assembly for two of the three CoALS genes. Data for the third gene was sufficiently complete that primers could be designed for PCR amplification and Sanger sequencing of the product, delivering a final gene sequence. Overall, combining long-and short-read approaches generated a morecomplete picture of the ALS family than was evident in the previous genome sequence that was assembled without the benefit of the long-read data.  Table 1. Molecular sizes (in kb) are shown at the left of each gel image. GeneRuler 1 kb DNA ladder (Thermo Fisher Scientific) was used in all experiments. Variability in the CT region of CoAls800 was located in the SSSEPP region (A), while in CoAls4210, the GSGN region was also variable in size. The CT region of CoAls4220 was also variable in size, due to sequence differences that encode the 3' half of the CT region (C). https://doi.org/10.1371/journal.pone.0215912.g005

Fig 6. Strain-and growth-stage differences in CoALS gene expression.
Real-time RT-PCR was used to quantify relative expression levels for the three CoALS genes in the six C. orthopsilosis strains grown for either 1 h or 24 h at 30˚C in YPD liquid medium. Lower numbers indicate a smaller difference between expression of the gene and the ACT1 control, suggesting higher overall relative expression. Gray-scale coding indicates higher (darker gray) and lower expression (lighter shading). CoALS800 showed the lowest expression level at 1 h incubation in all the strains tested (P<0.0001). CoALS4220 was expressed more highly than the other genes (P<0.0001 at 1h, P<0.001 at 24h), although its transcription level was lower at 24 h compared to 1 h. https://doi.org/10.1371/journal.pone.0215912.g006 Candida orthopsilosis agglutinin-like sequence (ALS) genes Among the newly characterized CoALS genes, CoALS4220 looks most like the ALS genes that were described in C. albicans because of the presence of multiple copies of a tandemly repeated sequence in the center of the gene. In CoALS4220, however, this sequence includes repeat copies that are missing 1 or 2 amino acids, a feature that was not observed in any of the C. albicans proteins. The CoALS800 and CoALS4210 genes are unique among currently characterized ALS genes because they do not possess a tandemly repeated sequence and, therefore, are shorter than most Als proteins for which adhesive function has been demonstrated. For example, C. albicans Als3 is produced from two alleles in strain SC5314: one protein is 1155 amino acids and the other is 1047 amino acids, due to the presence of three fewer copies of the tandemly repeated sequence. The shorter protein contributes less than the larger protein to C. albicans adhesion, presumably because the longer protein is better able to project the NT-Als adhesive domain away from the C. albicans cell surface [33]. These CaAls3 sequences are over 300 amino acids longer than the CoAls800 and CoAls4210 sequences described here. However, recent work shows that CoALS4210 contributes to C. orthopsilosis adhesion because deletion of the gene results in reduced adhesion to HBECs [34].
The current study is also unique in that it examines ALS gene expression in multiple clinical isolates using a quantitative method and demonstrates notable strain-specific gene expression patterns. Overall, C. orthopsilosis shows differential regulation of its ALS genes by growth stage, a theme that was also found in C. albicans. C. albicans ALS4 is up-regulated in cells from a saturated culture [35] whereas C. albicans ALS1 is highly expressed in cells that are transferred to fresh growth medium [36]. In C. orthopsilsosis, CoALS800 was more highly expressed as a culture aged, while CoALS4220 was more highly expressed in a 1-h culture. CoALS4210 expression patterns varied by strain, with some strains showing higher relative expression in a young culture and others exhibiting little gene expression difference regardless of which growth stage was examined.
Future studies will be aimed at associating gene expression data with adhesive function in different experimental models. Previously characterized strain 124, described as highly adhesive to expholiated buccal cells [11], shows a strong relative expression of CoALS4220 (Fig 6), which could be responsible for its higher relative adhesion [11], as also demonstrated in C. albicans, whose ALS gene expression levels are positively correlated with Als protein abundance.
C. orthopsilosis is closely related to Candida parapsilosis; it has been less than 15 years since the species were recognized as distinct [16]. Publications describe C. parapsilosis ALS gene content as highly divergent by strain. Pryszcz et al. [14] examined whole genome sequences from a variety of C. parapsilosis isolates and noted one strain with 5 genes, while others encoded only 1. In our current study, PCR primers designed to recognize CoALS800, CoALS4210 and CoALS4220 amplified the predicted products from each of 6 C. orthopsilosis isolates. Amplified Fragment Length Polymorphism analysis of four of the isolates (ATCC 96139, 85, 124, and 331) was reported previously [37]. UPGMA analysis showed that these strains belong to different clusters, indicating genetic diversity among the isolates used in the current study. These observations suggest broad conservation of the three CoALS genes within the species. Recently, it has been shown that the majority of C. orthopsilosis strains are hybrids between a Parental Species A (non-hybrid, of which the homozygous isolate 90-125 is representative) and a Parental Species B, which has not been isolated in non-hybrid form [38,39]. Interestingly, three of the 4 clinical isolates used in this study are known to be hybrids belonging to different clades, namely strains Co85 (Clade 1), Co331 (Clade 2), and Co124 (Clade 4.1) [39]. It has been suggested that C. orthopsilosis and C. metapsilosis hybrid formation may have facilitated a change in pathogenicity to humans [39,40]. Further analyses will be required to investigate potential association between adhesion ability and hybrid genomes.
Although Co85, Co124, Co331 and 90-125 all have diploid genomes, evaluation of copy number variation in 1-kb windows across the genomes evidenced the presence of a single copy of CoALS4210 in strain Co85 [39]. This result did not seem to affect ALS mRNA levels detected by RT-PCR in strain Co85. We can conclude with certainty that strain 90-125 does not have additional ALS genes, but we cannot exclude the presence of other ALS genes in the remaining strains. Ongoing work in C. orthopsilosis will continue to characterize the ALS family and adhesive function of the Als proteins in this species.