A family of silicon transporter structural genes in a pennate diatom Synedra ulna subsp. danica (Kütz.) Skabitsch

Silicon transporters (SIT) are the proteins, which capture silicic acid in the aquatic environment and direct it across the plasmalemma to the cytoplasm of diatoms. Diatoms utilize silicic acid to build species-specific ornamented exoskeletons and make a significant contribution to the global silica cycle, estimated at 240 ±40 Tmol a year. Recently SaSIT genes of the freshwater araphid pennate diatom Synedra acus subsp. radians are found to be present in the genome as a cluster of two structural genes (SaSIT-TD and SaSIT-TRI) encoding several concatenated copies of a SIT protein each. These structural genes could potentially be transformed into “mature” SIT proteins by means of posttranslational proteolytic cleavage. In the present study, we discovered three similar structural SuSIT genes in the genome of a closely related freshwater diatom Synedra ulna subsp. danica. Structural gene SuSIT1 is identical to structural gene SuSIT2, and the two are connected by a non-coding nucleotide DNA sequence. All the putative “mature” SITs contain conserved amino acid motifs, which are believed to be important in silicon transport. The data obtained suggest that the predicted “mature” SIT proteins may be the minimal units necessary for the transport of silicon is S. ulna subsp. danica. The comparative analysis of all available multi-SITs has allowed us to detect two conservative motifs YQXDXVYL and DXDID, located between the “mature” proteins. Aspartic acid-rich DXDID motif can, in our opinion, serve as a proteolysis site during the multi-SIT cleavage. The narrow distribution of the distances between CMLD and DXDID motifs can serve as additional evidence to the conservation of their function.


Introduction
Silicon (after oxygen) is the second most abundant element on the planet, amounting to 28% of Earth's crust by mass [1]. In water environments it is available in the form of dissolved silicic acid (H 2 SiO 4 ) [2], mostly delivered to water bodies by river inflow, direct atmospheric deposition and/or biogenic silica dissolution. Essentially, via diatom biomineralisation and on a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 geological timescales via continental weathering [3]. The silica cycle was shown to be tightly bound to that of carbon [4,5]. Silicic acid is an important biogenic element that a variety of organisms with use to build different siliceous structures or to strengthen their cell walls, including diatoms, chrysophytes, metazoans, silicoflagellates, radiolarians, choanoflagellates, embryophytes and other organisms [6][7][8].
In particular, diatoms (Bacillariophyta) are one of the most numerous and ecologically important groups of phytoplankton. Via the use of silicic acid, they form ornate species-specific exoskeletons [9] and globally assimilate 240±40 Tmol of silicic acid a year [10], and providing up to 20% of the Earth's primary organic carbon production [11].
Critical for biosilicification is the process of capture and transport of silicic acid from the aquatic environment through the plasmalemma. Silicic acid is transported into the cell by SITs (silicon transporters) against a concentration gradient. The molecular mechanism of such transport remains unclear.
Besides the regular SIT genes described above, some genomes were found to contain multi-SIT genes which encode several SIT proteins merged head-to-tail within a single reading frame 46 such genes were identified to date. They are present in marine and freshwater diatoms of all three classes: centric, multipolar centric and pennate, as well as a single ciliate species [25,26,36].
The phylogenetic analysis of the putative "mature" SIT proteins in multi-SITs, ie the sequences produced by cleaving them into fragments with 10 transmembrane domains and all the conservative sites each, in marine diatoms has shown that they all belong to the clade B, a paraphyletic basal clade [25,26]. According to the authors' hypothesis, this clade has emerged when silica concentration in the ocean was higher than it is now. In modern diatoms, clade B SITs are important under elevated silicic acid concentrations [25].
Two such genes, located in a single chromosome at the distance of 4 kbp from each other, were recently found in a freshwater araphid pennate diatom Synedra acus subsp. radians from Lake Baikal [36].
In this work we have sequenced three multi-SIT genes, as well as 5 0 -and 3 0 -ends of their mRNAs, in a freshwater araphid pennate diatom Synedra ulna subsp. danica. The comparative analysis of the multi-SIT sequences from a range of diatoms has allowed us to detect aspartaterich conservative motifs (DXDID) which, in our opinion, can serve as the proteolysis sites necessary for processing a multi-SIT into multiple "mature" proteins SIT, each capable of transporting silicic acid. Multi-SITs in the genus Synedra, like other multi-SITs, belong to clade B and have arisen from a single non-multiplicate ancestor through a series of duplications.
We used a non-axenic culture of S. ulna subsp. danica (Kütz.) Skabitsch. grown from a single cell sampled in Lake Baikal. The cells were cultivated for four weeks at 16˚C with intermittent mixing under a natural day-night biorhythm in 20-L glass bottles filled with Diatom Medium (DM) [42].
The cells were then collected on polycarbonate filters (5-μm pores) (Whatman, USA), briefly rinsed with cold DM medium, harvested by centrifugation for 2 min at 16.100 g and 4˚C and then stored at −70˚C.
High-molecular-weight genomic DNA was isolated according to a modification of the method of Jacobs et al. [43] (protocol available at dx.doi.org/10.17504/protocols.io.qh6dt9e).

Search for SIT structural genes in the genome of S. ulna subsp. danica
We developed bidirectional degenerate primers (UniCMLDF-UniCMLDR) (S1 Table) using comparative analysis of the SIT structural genes cluster (SaSIT-TD, SaSIT-TRI) in S. acus subsp. radians (GenBank accession no. KX345281) and the nucleotide sequences encoding the conserved CMLD motifs and their context in P.-n. multiseries SIT (JGI protein ID: 338018), Achnanthes exigua SIT (GenBank accession no. EF530636), Epithemia zebra SIT (GenBank accession no. EF065521) and P. tricornutum SIT (GenBank accession no. EU879096) structural genes. PCR products were cloned in the pJET 2.1 vector (Thermo Scientific, USA), and the inserts were sequenced following plasmid DNA isolation with the GeneJET Plasmid DNA Purification Kit (Thermo Scientific, USA). The primer pairs 1100F-739R and 3403F-276R were used to identify a link amongst the PCR fragments deciphered (S1 Table).
The primer pairs 1657F-92R, 9288F-5561R, 1233F-862R and 3531F-172R (S1 Table) were used to determine the location of the S. ulna subsp. danica SIT structural genes relative to each other. The amplicons were extracted from the reaction mixtures by sorption on the AMPure XP magnetic particles (Agencourt, USA), then directly sequenced by Sanger technique with BigDye 3.1 reagent (Applied Biosystems, USA) and finally analysed with 3130XL Analyzer (Applied Biosystems, USA).

Isolation of total RNA
An RNeasy Plant Mini Kit (QIAGEN, Germany) was used to isolate total RNA. The quality of the RNA isolated was assessed by means of an Agilent 2100 Bioanalyser with the RNA 6000 Nano LabChip Kit (Agilent Technologies, USA). Products with an RIN index equal to or higher than eight were used for Step-Out RACE-PCR.
Sequencing of the 5 0 -and 3 0 -termini in the mRNA of the SuSIT structural genes in S. ulna subsp. danica A Mint RACE cDNA Amplification Kit (Evrogen, Russia) was used to sequence the 5 0 -and 3 0termini in the mRNA of SuSIT structural genes in S. ulna subsp. danica, according to the manufacturer's instructions.
Several rounds of 5 0 -and 3 0 -Step-Out RACE PCR of the SuSIT structural genes of S. ulna subsp. danica were performed with different pairs of primers (S2 Table). After each round, the PCR products were analysed by electrophoresis in 1% agarose/ethidium bromide gel in TAE buffer, cloned and sequenced.

Bioinformatic analysis of nucleotide sequences
Vector NTI 11 Academic software (Invitrogen, USA) was applied to analyse the nucleotide sequences of the PCR products and the cDNA. Alignment of the amino acid and nucleotide sequences was performed with the CLUSTALW algorithm using the BioEdit 7.2.5 [44] and Vector NTI 11 software. TmPred [45] software was used to identify transmembrane domains. Pfam [46] software was used to identify boundaries between the putative "mature" SIT proteins in the predicted amino acid sequences. The putative "mature" SIT proteins had approximately the same length of 440 amino acids, where the Markov model of the domain PF03842 can be plotted.
The maximum likelihood trees of the SIT amino acid sequences were built using RAxML 8.0 [47] under the LG substitution model [48] and gamma rate distribution (gamma parameter = 4). This model was selected using Bayesian information criterion via the "find best DNA/ protein model" option in the MEGA 6.0.6 software package [49]. 1000 bootstrap trees [50] were built to estimate the node support.

Identification of the SIT structural genes in S. ulna subsp. danica
To develop SIT structural gene-specific primers, we compared and aligned the nucleotide sequences encoding the CMLD motif of the SIT protein and some amino acids surrounding it (CMLDFIN) for S. acus subsp. radians SIT, P.-n. multiseries SIT-TRI, A. exigua SIT, E. zebra SIT and P. tricornutum SIT (Fig 1). As a direct primer (UniCMLD F), we chose the sequence of the sense DNA chain encoding CMLD and its context, taking into account codon degeneracy and persistence of nucleotide positions. In a similar fashion, we took a complementary antisense sequence of this site as a reverse primer (UniCMLD R). The PCR with these primers and genomic DNA of S. ulna subsp. danica gave us a mix of amplification products, all approximately 1400 nucleotides long, that we separated by cloning. The individual amplification products within the plasmid DNA were isolated, sequenced and compared with the SIT structural genes of S. acus subsp. radians. Multiple alignment revealed that the cloned DNA fell into four groups. For each of these groups, we developed unique primers directed towards the CMLD motif. These primers were used to amplify the DNA. This time, the PCR fragments obtained were sequenced directly without cloning. The RACE performed on the total RNA of S. ulna subsp. danica and the SIT structural gene-specific primers developed enabled us to identify the 5 0 -and 3 0 -termini of the corresponding SIT mRNA and to reconstruct the sequences of the SIT structural genes. The sequences obtained were verified by direct sequencing of the genomic DNA amplification products (Fig 1). By this approach, we found that SIT structural genes are present in the S. ulna subsp. danica genome (we named them SuSIT1/ SuSIT2 (GenBank ID MF971079) and SuSIT3 (GenBank ID MF971078) (S1 File), and each of them encodes a polypeptide chain containing three CMLD motifs.
In order to find a relative position of the SuSIT structural genes in S. ulna subsp. danica, we used a combination of terminal primers directed outside the gene. The PCR products from the interior of the gene cannot be amplified under such conditions, whilst the DNA fragments overlapping the site between the SuSIT structural genes are being amplified. We found that the genome of S. ulna subsp. danica had three rather than two structural genes encoding long Sit genes of Synedra ulna subsp. danica proteins. Each of them contained three CMLD motifs, and two identical genes (SuSIT1 and SuSIT2) connected by a non-coding DNA sequence 5239 bp long (Fig 2). The location of the SuSIT3 structural gene relative to SuSIT1 and SuSIT2 remained unknown.
Schematics of the S. ulna subsp. danica SIT genes are shown in Fig 2. The nucleotide sequence of the SuSIT1 gene suggests that it can produce a long amino acid sequence, which can be divided into three smaller fragments mapping to the SIT HMM. Each of these fragments contains a single CMLD motif, as well as four GXQ motifs, and is between 436 and 446 amino acid residues in length. It is known that SIT proteins with one conserved CMLD motif (or its analogue) and four GXQ motifs are supposed to transport silicon [14][15][16]. All these fragments also contain six out of seven conserved amino acids: Q104, N115, H190, Y193, S229 and S372 (positions relative to the PtSIT1 sequence) (S2 File) [51].
Hence, we may propose a hypothesis that each of the three fragments of the SuSIT1 gene product could be a complete SIT protein able to perform its function, and that SuSIT1 product is posttranslationally cleaved to produce these "mature" proteins. We named these putative "mature" proteins SuSIT1A, SuSIT1B and SuSIT1C, as shown in Fig 2. SuSIT1A and SuSIT1B are highly similar to each other, but relatively distant from SuSIT1C (S3 Table). It is possible that this difference in structure reflects differences in their functions. The sequence, and therefore the structure, of SuSIT2 is identical to that of SuSIT1.
The SuSIT3 structural gene is also transcribed in a single reading frame and can produce a predicted protein 1416 amino acids long. Three putative "mature" proteins (SuSIT3A, SuS-IT3B, and SuSIT3C) 436-446 amino acids long are present in the predicted amino acid sequence.
Having deciphered the SuSIT structural genes and identified nine predicted "mature" proteins, we compared them to each other. The conserved CMLD motifs appear to be present in all the "mature" proteins from both SuSIT1/2 and SuSIT3. The same was also true for the conserved GXQ motifs.
The TmPred software showed that each of these "mature" proteins is an integral membrane protein and contains 10 transmembrane domains. The CMLD motif is located at the edge of a hydrophobic region (Fig 3). These data support an argument in favour of the hypothesis that the predicted 436-446-amino-acid-long fragments of S. ulna subsp. danica SITs are "mature" proteins able to fulfil their main function of silicon transport. Identity and similarity data for the predicted SaSIT and SuSIT proteins are given in S3 Table. Forty-eight full-length predicted amino acid sequences of multi-SIT proteins from 17 diatom species have been published to date [25,26,36]. Comparing their sequences, we found conserved motifs with the general formulae YQXDXVYL and DXDID (Fig 3 and S3 File). The latter motif is always located at the boundary of the putative "mature" SIT proteins and absent from the triplicate SIT termini. Thus, it could be a target for the aspartic-acid-specific protease responsible for the posttranslational cleavage of these multi-SITs into "mature" proteins.
Earlier western blotting experiments on the S. acus subsp. radians total protein have detected a polypeptide with a molecular weight of approximately 64 kDa that binds antibodies to the synthetic peptides derived from the SIT amino acid sequence [52]. There is no evidence to the existence of proteins with the molecular weights of 94 or 152 kDa interacting with these antibodies. Antigens of the antibodies to another SIT-derived peptide in several other diatoms' proteomes (Thalassiosira pseudonana, Skeletonema costatum, Chaetoceros gracilis, Cyclotella meneghianana, Cylindrotheca fusiformis, Ditylum brightwellii, Nitzschia pelliculoza, Phaeodactylum tricornutum) had weights of 55-64 kDa [53]. DXDID motifs of multi-SITs aren't flanked by proline residues which could potentially bend the protein secondary structure and thus Sit genes of Synedra ulna subsp. danica prevent caspase binding. This fact can serve as an additional line of evidence for their role as the proteolysis site. Support for this hypothesis also comes from the fact that the distances between the CMLD and DXDID motifs vary in a very narrow range for most diatoms (Fig 4). It suggests that their relative location in the three-dimensional structure of the protein is conserved as well, and thus that these motifs have a conserved function in all the multi-SITs.
On the other hand, we cannot exclude the possibility that the triplicate proteins are functional without being processed. Another conserved motif, YQXDXVYL, is present in all "mature" S. ulna subsp. danica SIT proteins. For the moment, we have no idea what the function of this motif could be.
The aspartic acid-rich sites are well known in the literature as targets for the caspase family of proteases [54]. Although they are mostly known for their role in cell death processes, proteins of this family were shown to be expressed constitutively in the genome of Thalassiosira pseudonana [55]. It is possible that they take part in the processing of multi-SIT or other similar multiplicated proteins.
Durkin et al. examined phylogeny of approximately 400 SITs deciphered before 2016 from more than 100 diatom species and from a small number of other silicifying organisms [26]. Many predicted SIT proteins appeared to consist of 2-3 merged monomeric proteins. Each of these monomers contains one conserved CMLD motif and four conserved GXQ motifs, as do Sit genes of Synedra ulna subsp. danica "mature" proteins within the SaSIT and SuSIT proteins (S2 File). In the S2 File, we are presenting amino acid sequences of the SITs that were split in silico using Pfam software and aligned in order to measure their identity and similarity. There are data concerning S. ulna subsp. danica SITs from the present study, as well as data on S. acus subsp. radians SITs that were published previously [36] The sequences of 87 predicted "mature" SIT proteins are present in S2 File. The CMLD motif was absent only from three proteins (two out of seven Extubocellulus spinifer proteins, one out of six Leptocylindrus danicus proteins). We would like to note a high identity of localization of many separate amino acid residues in these proteins. For example, the G residue occur in the positions 79, 192, 228, 363, 451, 497, the absolutely conserved W residue occur in the position 466, whilst the L residue is present in the positions 67, 119, 194, 459. The phylogenetic distances between the diatoms from the S2 File are high. This could indicate that some homologous amino acid residues did not change during the evolution. These conserved amino acid residues are likely to participate either in correct folding of tertiary structures or directly in silicon transport.
We found that forty-three complete multi-SIT proteins had the same predicted targets of processing protease (DXDID) as SuSIT and SaSIT do at the boundaries between "monomeric" SITs. Their locations relative to each other are somewhat different, and they are absent from the S3 File.
It is possible that the highly conserved sequences of the predicted "mature" SIT proteins are an essential structure, which perform silicon transport in diatoms. Their more detailed study could possibly shed new light on the silicon transport mechanism.
Since the predicted "mature" SIT proteins within the triplicate and duplicate protein-coding genes of the two Synedra species have high sequence identity to each other, we assume that they have emerged in a relatively recent series of duplication events. Duplication that has separated A/B and C proteins probably predates that between proteins A and B, as evidenced by their higher divergence.
To elucidate the evolutionary history behind the different SIT proteins of two closely related Synedra species, we have performed the phylogenetic analysis. To place our sequences into a broader context, their "mature" proteins were aligned to the SITs from the earlier work [26], and a maximum likelihood tree was built. This tree (S1 Fig) was similar in topology to earlier work [25,26] and included all five clades. Like the "mature" SIT proteins from the multi-SITs of marine diatoms, Synedra sequences belonged to clade B. They formed a single clade, within it, which means that they have a single non-duplicate ancestor not shared with other sequenced SITs (Fig 5 and S1 Fig). This ancestral gene was tandemly duplicated in the genome of these two species 0 common ancestor, creating a structure not unlike SaSIT-TD. Its 5 0 region was duplicated further, forming a triplicate ancestor to all modern Synedra multi-SITs. This ancestor was duplicated again, though this time it formed two separate paralogs instead of a single multidomain gene. During the species divergence S. ulna subsp. danica has retained both paralogs, one of which has given rise to SuSIT1 and SuSIT2, while the other became SuSIT3. S. acus subsp. radians, on the other hand, has lost one of the ancestral paralogs, but duplicated another one (together with the flanking sequences). This duplication explains the neighbouring positions of SaSIT-TRI and SaSIT-TD. Finally, the 5 0 -paralog has lost its SIT1A and SIT1B part in the deletion, thus becoming SaSIT-TD (Fig 5).
Of special interest is the absolute identity of the nucleotide sequences of the SuSIT1 and SuSIT2 structural genes. Such identity could be a result of a very recent duplication, in particular non-crossover gene conversion [56]. Gene conversion is the process by which a gene replaces a homologous gene such that the genes become identical after the conversion event [57]. The mechanism has not yet been fully understood. In a recent review Korunes and Noor [56], point out that "gene conversion has major evolutionary consequences, ranging from immediate effects on nucleotide diversity to long-term consequences that shape genome evolution, species formation, and species persistence" [56].
To our knowledge, non-crossover gene conversion haven't yet been found in diatom genomes.
Supporting information S1