Marine diatoms change their gene expression profile when exposed to microscale turbulence under nutrient replete conditions

Diatoms are a fundamental microalgal phylum that thrives in turbulent environments. Despite several experimental and numerical studies, if and how diatoms may profit from turbulence is still an open question. One of the leading arguments is that turbulence favours nutrient uptake. Morphological features, such as the absence of flagella, the presence of a rigid exoskeleton and the micrometre size would support the possible passive but beneficial role of turbulence on diatoms. We demonstrate that in fact diatoms actively respond to turbulence in non-limiting nutrient conditions. TURBOGEN, a prototypic instrument to generate natural levels of microscale turbulence, was used to expose diatoms to the mechanical stimulus. Differential expression analyses, coupled with microscopy inspections, enabled us to study the morphological and transcriptional response of Chaetoceros decipiens to turbulence. Our target species responds to turbulence by activating energy storage pathways like fatty acid biosynthesis and by modifying its cell chain spectrum. Two other ecologically important species were examined and the occurrence of a morphological response was confirmed. These results challenge the view of phytoplankton as unsophisticated passive organisms.

built with log2-transformed cell density values between every time point both in single cylinders and on the mean values. The maximum division rate was calculated on the steepest portion of the growth curve. For C. decipiens the number of separating chains (chains presenting at least one separation point identified by two adjacent separating cells within the chain itself, see Fig. 3, inlet) was recorded as well. The chain length of the separated subchains was recorded and compared among replicates and between treatments and controls. The number of mechanically broken chains was recorded as well. Broken chains can be recognised because at one or both apices they lack the terminal cells, identifiable by the presence of terminal setae (see main text and Fig. 2, inlet). Statistical relevance of results was estimated by applying Wilcoxon and Kolmogorov-Smirnov two-sample (KS2) tests (alpha 0.001 i.e. 99.9% accuracy) in MatLab. If the null hypothesis (samples are identical) can be rejected, the test result equals 1, otherwise the outcome equals 0. Both tests were carried out between replicates (turbulent vs. turbulent or still vs. still) in order to verify congruence of replicates and between experimental and control samples (turbulent vs. still). All possible pairwise combinations were tested.
Here Jaccard clip was used to mitigate false fusion of adjacent transcripts 6 . The assembling resulted in 27,923 contigs with a N50 value of 1,912 bp and average contigs length 1,211 bp. We chose to filter out transcripts showing no relevant expression. To this aim, we quantified transcript expression levels by mapping reads against the assembled transcriptome using Bowtie (ver. 1.1) with parameters: --p 20 --chunkmbs 10240 --maxins 500 --seedlen 20 -tryhard -a -S 7 . To count the reads mapped we used the idxstats program in Samtools (ver. 0.1. 19-44428cd). Only transcripts with count per million (CPM) greater than one for at least 2 out of 8 samples were considered expressed beyond background noise and retained.

Refinement of domain annotation by Meta-CLADE.
The importance of a fine domain annotation is crucial to estimate the abundance of certain functional classes of domains within the community and the absence of others, and to highlight enzymatic biochemical functions associated to specific environmental substrates. Probabilistic models generated from the consensus of a set of homologous sequences are often used as representations of protein domains. When sequences are well conserved, the resulting probabilistic model captures the most conserved features in the domain sequences and these latter can be successfully used to find new domains in databases of sequences, sharing the same features of the original sequence. However, when sequences have highly diverged, consensus signals become too weak to generate a useful probabilistic representation, and models constructed by global consensus do not properly characterize domain features. To overcome this fundamental bottleneck, we introduced CLADE-centered models 19 , defined by considering a set of homologous sequences, and we constructed a model for each sequence Si from a set of conserved sequences that are similar to Si. By so doing, the probabilistic models generated to represent a domain, are displaying features that are characteristic of each sequence Si, and they might be very different from Si to Sj. More divergent the homologous domain sequences Si and Sj are, more the models constructed from these sequences are expected to display different features. It is therefore important for a domain to be represented by several models that can characterise its different paths of evolution within different clades. The multi-source representation has been demonstrated to be more efficient than a single source strategy for domain identification in Bernardes et al. 19 . Based on multiple models associated to each domain, we constructed a large probabilistic model library for all Pfam domains. For each domain, the library includes the Pfam consensus models and about 350 clade-centered models. This amounts to more than 2.5 million probabilistic models. Such models have been previously used in CLADE 19 , a tool for domain identification designed for improving genome annotation. META-CLADE (Ugarte et al, in preparation) has been designed based on CLADE with the purpose to annotate metagenomics and metatranscriptomics reads. Due to the read features described above, these data demand a special treatment and a specific algorithmic design. META-CLADE is based on two main steps. The META-CLADE first step takes as input a set of metaG/metaT reads or contigs where CDS/ORFs have been previously annotated, together with all probabilistic CLADE-centered models and Pfam pHMM models. For each sequence, the CDS region is scanned with the probabilistic models and all domain hits are identified. For each hit, two scores are assigned, the bit-score for the entire hit and the mean-bit-score for each hit residue (obtained as the bit-score of the hit divided by the length of the hit). The META-CLADE second step estimates domain specific thresholds for bit-scores and mean-bit-scores, that best separate true hits from false hits, for each domain. Hence, each CDS sequence is represented by a set of domain hits, where each domain hit is described by its bit-score, mean-bit-score, predicted state (positive or negative) and probability of being a positive/negative prediction. The bit-score and mean-bit-score are provided by the matching of a probabilistic model, and the predictive state of the hit together with its probability are provided by a learning step. Then, the set of hits is filtered in several steps: 1. all pairs of overlapping hits that are associated to the same domain and that cover at least 85% of the length of at least one of the two hits, are filtered from the list of CDSs. For such pair of hits, we eliminate the hit that has the lower bit-score. 2. Second, META-CLADE filters out all hits having a bit-score which is smaller than the threshold provided by positive sequences, as determined in the analysis of fragmented segments in read sequences. 3. Based on the parameter estimation obtained with the BayesNaiveBayes method applied to each Pfam domain, the filter accepts only those domain hits that are positive hits and whose probability p for being a positive hit is p > .9. 4. Hits are filtered by best E-value score, associated to each CLADE-centered model and pHMM model. Namely, domain hits are ordered by considering lowest E-value scores and iteratively eliminated hits whether they overlap some domain at least 15% of the length of the hit sequence having best E-value. The output of this filtering pipeline is the contig annotated with non-overlapping domain hits.

Differential expression analysis.
A plugin, using EdgeR libraries 17 is included in Annocript and is used to select transcripts that are differentially expressed between still and turbulent conditions. Transcripts were considered differentially expressed if the false discovery rate (FDR) was smaller or equal to 0.05 and the fold change greater than 2. Another included plugin was used to perform enrichment analysis of GO terms and Pathways exploiting the Fisher exact test and the Benjamini and Hochberg correction of the p-values. GO terms and Pathways were considered enriched when associated to at least 10 differentially expressed transcripts with an adjusted p-value smaller than 0.1.

Supplementary Figures
Supplementary Figure 1

Supplementary Figure 3. iPath representations of DE transcripts a) at time point T2 b) and time point T3.
Green lines indicate upregulation in turbulence-exposed cells compared to still cultures. Red lines indicate down-regulation. The thickness of the lines is proportional to the extent of the change. Legend to the Supplementary Table 1 Supplementary Table 1 contains the output of the differential expression analyses performed at time point T2. The table is arranged as follows: each row is a transcript and columns contain transcript features. TranscriptName: is the name of the transcript as given in the FASTA file logFC: Fold Change in log scale, a positive FC means higher expression in turbulence, a negative FC higher expression in still PValue: P value for significance FDR: False Discovery Rate cpm_T2_B1_STILL, cpm_T2_F2_STILL, cpm_T2_A2_TURB, cpm_T2_E2_TURB: counts per million for the two still and two turbulent samples respectively HSPNameSP: this is the first HSP result (lowest e-value) as given from the BLASTx output against Swiss-Prot HSPEvalueSP: is the corresponding e-value assigned to the HSP DescriptionSP: description of the HSP EnzymeDescs: enzyme descriptions of the EnzymeIDs Pathways: Pathways of first level associated to the transcripts as they are found in UniPathway DescriptionUf: description of the HSP against UniRef HSPEvalueUf: the corresponding e-value assigned to the HSP Taxonomy 18 were also added. Transcripts involved in fatty acid biosynthesis are listed in Supplementary Table 3. Green shades indicate up-regulation, red-shades indicate down-regulation. Darker nuances indicate up-regulation above 1.5 or down-regulation below -1.5 (arbitrary cut-off values). For all the transcripts reported here, except those marked with the symbol + , a tBLASTn search in MMETSP database was performed and the first 50 hits are listed in Supplementary Table 11   The number of broken chains (BC) recorded during cell counts is reported for each TURBOGEN cylinder (turbulent condition A, C, E and still condition B, D, F). These numbers have been weighted over the total number of chains recorded in each sample (Percentage of BC). The overall percentage in turbulent and still conditions (red digits) was calculated as the sum of the BC recorded in each cylinder weighted over the total number of chains. These data were plotted in Fig. 4c (Experiment 1) and Fig. 4d  Legend to the Supplementary Table 11  Supplementary Table 11 contains the first 50 hits from a tblastn search in the MMETSP  database performed with transcripts listed in Supplementary Table 6.

Legend to the Supplementary Table 12
This file contains the annotation for the Chaetoceros decipiens transcriptome obtained from Annocript.
File Supplementary Data 2 contains the nucleotide sequences of the Chaetoceros decipiens transcripts.