Long Range Order and Short Range Disorder in Saccharomyces cerevisiae Biofilm

Biofilm, a colony forming cooperative response of microorganisms under environmental stress, is a major concern for food safety, water safety and drug resistance. Here, we investigated transcriptome-wide expressions of the biofilm yeast Saccharomyces cerevisiae in wildtype, and 6 previously identified biofilm regulating overexpression strains (DIG1, SAN1, TOS8, ROF1, SFL1, HEK2). Using a number of statistical parameters, our overall data reveal a strong transcriptome-wide invariance among all genotypes. This invariance is an indicator of order parameter that keeps the global structure of biofilm stable. Thus, although single mutants may show significant favourable local expression changes (short range disorder), the almost unperturbed global structure indicate gradual adaptive response converging to original stable biofilm states (long range order). Our results ask for a deeper understanding of transcriptome-wide (attractor) behaviour for selecting global regulatory targets for successful control of biofilm formation or progression.


Introduction
Transcriptome-wide expression using RNA-Seq techniques has gained tremendous momentum in the last decade. The successes have been mainly attributed to deeper sequencing and wider dynamic ranges compared to its predecessors such as the microarray [1]. An overwhelming majority of works, however, is investigating a limited subset of genes with higher expression levels or fold changes to infer differential regulation, functional pathways or regulatory networks using common statistical approaches. This is often with the belief that lower expression transcripts are marred with technical/operator induced noise or insignificant fold changing genes are not important. Although this is a necessary concern when we analyse and compare individual gene expressions, on a global transcriptome-wide scale, any arbitrary threshold cut-off may have negative consequences. For example, the popular power-law distribution is only observed when we have a large number of entities within the datasets [2]. Taking a subset of a dataset can lead to the disappearance of the statistical structure [3].
Information theory proposes that the expression level of an entity, such as a gene, carry no real meaning without context, or its probability. Information (or entropy) analysis requires three major characteristics: monotonicity, independence, and branching [4][5][6]. Monotonicity has to do with the universality (context of independence) of a given piece of information. Independence holds when the information gain from two independent experiments is the sum of the information gain from their combination.
Branching breaks a question into parts and bridges them in a tree structure, where following along the path the information gain should be additive [7][8][9]. Thus, analysing high throughput experimental data should not overly focus on the expression values alone (i.e. analysing only highly expressed or significant fold changing data), but look out for the principles used in information theory.
Living cells are complex, dynamical and dissipative systems considered to be in a state that is far from equilibrium [10,11]. In other words, living cells are dynamically exchanging matter for their survival, and are able to evolve spontaneously (say, under any external perturbation) towards a critical point, without fine-tuning system parameters, for a phase transition [12,13]. The formation of biofilm by certain microorganisms, such as the bacteria E. coli and yeast, is a classic example [14,15]. Such phase transformation is known to break the symmetry of the system leaving it invariant, or in a collective mode [16].
Previously, we have used information probability concepts and have shown for diverse mammalian cells, that pair-wise transcriptome-wide expressions, with several orders of magnitude difference between their expression levels, are very high (>0.98) [17][18][19][20][21][22]. The strong invariance, across the large expression variability between genes, is a signature of order parameter organising the entire gene regulatory network imposed by the presence of a common attractor correspondent to the cell type [23][24].
In this paper, to investigate the yeast Saccharomyces cerevisiae biofilm phase transition, we adopted information theory related statistical approaches to analyse the transcriptome-wide RNA-seq expressions in wildtype and 6 overexpressed gene mutants (DIG1, SAN1, TOS8, ROF1, SFL1, HEK2) [25,26]. Using scatter plots, distribution fitting, Pearson correlation, principal components analysis and hierarchical clustering, our data show that any single gene mutant, related to biofilm regulation, despite producing noticeable local expression changes, do not perturb the global collective structure of yeast in any significant manner. This indicates that, over time, the biofilm morphological changes generated by each mutant will be attenuated by the strong global regulatory structure. Thus, biofilm control using single gene target may not be a viable option for successful longterm effects. Our data, therefore, ask for a deeper consideration for the understanding of the global statistical structure of dynamic living systems.

Results
We analysed the whole transcriptome of wildtype and 6 overexpressed gene mutants (DIG1, SAN1, TOS8, ROF1, SFL1, HEK2), with 4 replicates for each condition, of biofilm modulators in the yeast Saccharomyces cerevisiae strain F45, adapted from Cromie et al. [25,26]. These genes are known to regulate the fluffy colony morphology of the yeast strain studied. There was a total of 11,236 unique read counts with non-zero expression in the datasets across all genotypes (Materials & Methods). We next performed normalisation using TPM method [27].

Statistical frequency distributions show lognormal as most probable
It has been shown on several occasions by others, and also by us, that gene expression distributions follow powerlaw or lognormal distributions due to their scale-free network organisation or complex transcriptional regulatory mechanisms [2,[29][30][31]. Hence, we tested the frequency distribution of gene expressions in all genotypes (See Fig. 1 and Fig. S1) using 5 statistical distribution fits: i) log-normal, ii) log-logistic, iii) Pareto (power law), iv) Burr, and v) double Pareto lognormal (DPLN) (Materials & Methods). We observed that, in all datasets, the expression distributions fitted all 5 tested distributions above TPM > 1. The lognormal has a slight advantage for the higher expressed genes (TPM > 5), while the DPLN fitted better at the lower expressed genes (TPM < 5).
This result suggests that the two tails of the gene expression distributions might be due to two different regulations [32]. The upper tail corresponds to higher gene expressions, from transcripts produced through more constitutive gene expression mechanisms, less prone to biological variations or to technical noise. The lower tail corresponds to lower gene expressions, arising from rare transcripts generated from highly stochastic (noisy) transcriptional events, and the measurement of these low expressions is more prone to technical errors [31,33]. It will, therefore, be important to carefully determine a threshold to discriminate between genes that carry important information on the biological response from noisy data. Blue, green, red, orange, and magenta curves indicate lognormal, log-logistic, power-law (Pareto), Burr and double Pareto log-normal fitting (DPLN) respectively. The thick black curve corresponds to the experimental distribution in the wildtype (WT) condition (average of 4 replicates). Parameters of the fitted distributions are summarised in Table  S1.
From the statistical structure, the threshold is at about TPM = 5, and we will scrutinise this threshold on the next section using correlation.

Strong transcriptome-wide invariance between biological replicates and different overexpression conditions
We computed pair-wise Pearson correlation, r, for the whole transcriptome (N = 11,236) between the replicates of each genotype (See Fig. S2). The result shows that the whole transcriptome correlation between each replicate is very close to 1 (mean > 0.987, See Fig. 2A, left panels) for all genotypes. Such strong transcriptome-wide correlation invariance has also been observed for other cell types [17][18][19][20][21][22]. Next, comparing cross-correlations, rc, (correlations between genotypes), although the values are lower, they also remain near unity albeit slightly lower (mean between 0.945 and 0.972, See Fig. 2A, right panels). The data suggest that the global regulation is keeping the transcriptome-wide correlation high despite the local expression changes between the genotypes.
To observe the effect of highly expressed transcripts on the correlation values, r and rc values were evaluated for the top ranked 50 and 500 highest expressed genes (See Fig.  2B). Notably, the r values remained very high (mean between 0.960-0.988). However, the rc values were noticeably lowered. For example, the mean rc between WT and other conditions for the 50 highest genes expressed is 0.859, while it is 0.936 for 500 highest expressed genes and 0.955 transcriptome-wide. Note that removing either 500 highest or lowest expressed genes from the transcriptome had little effect on r and rc values compared with whole transcriptome (See Fig. S3A and B).  Fig. S3C and D). This result confirms that genes with TPM < 5 carry high noise that destroys the correlation structure.
Finally, to test whether sample size impacts correlation values, we tested various sizes for random genes selections (Materials & Methods) for each genotype 100 times and computed their r values (See Fig. 2C). Notably, both r and rc values are significantly higher than the highest expressed genes, and similar to those of whole transcriptome values for all genotypes. Thus, the global or transcriptomewide statistical structure is also revealed by the random selection of genes rather than choosing the highest expressed ones. In other words, random selection of genes shows the fractal nature of transcriptome-wide gene expressions.
Overall, these data show that the highest or lowest expressed genes between different genotypes are differentially regulated causing lowering of their correlation values (See Fig. 2D): highly expressed genes show lower correlation because of their large variance in expression values (larger range of activation). In contrast, the decrease in correlations for lower expressed genes is due to noninformative noise. These data suggest that on a global scale, the technical and biological noise for the gene expressions is not sufficient enough to destroy the statistical structure of all genotypes and replicates investigated for most of the transcriptome (TPM > 5, N = 6,328). It also supports the view that even lowly expressed transcripts (i.e. TPM ~ 5) play an important role in keeping the cells in function [24]. It is, therefore, important to check the statistical structure before arbitrarily discarding a large portion of lowly expressed genes assuming that they carry no information or simply noisy. Hence, based on both statistical structure (distributions) and correlation, we chose to retain 6,328 genes with TPM > 5 for further global analysis.

Principal Component Analysis and statistical
clustering reveal similarity/diversity between overexpressed conditions We evaluated the principal components (PC) for different sample sizes of sorted genes from the highest to lowest expression values (N = 10, 30, 50, 100, 150, 200, 300, 500 and 6,328, i.e. TPM>5, See Fig. 3A-C). PCs 1 to 3 constituted more than 76% variance (in whole informative transcriptome, N = 6,328, See Fig. S4A), and we plotted them for each genotype for the different N sizes (See Fig. S4B). Notably, the PC loading scores (position) of each genotype became fixed when N ≥ 100 (See Fig. S4C). That is, a minimum sample size of 100 highest expressed genes is sufficient for the convergence of PCs with the whole transcriptome data. If N < 100, the PC values became more variable and overlapped with other genotypes. Looking closer at PCA loadings (See Fig. 3A, left), we observe PC1 is able to discriminate replicates in the TOS8 overexpression condition with no overlap with other conditions, while extreme loading scores for PC2 are obtained for the replicates of the wildtype condition, and PC3 extreme scores are for the ROF1 condition. Other PC vectors, in contrast, show low specificity with high overlap between replicates of different conditions. When plotting PC1 vs. PC2 and PC3, we also observe clustering and seclusion of the wildtype replicates, as well for TOS8 replicates (See Fig. 3A, middle and right). Replicates of other conditions also tend to cluster together, however, with more overlap with other conditions. This is also confirmed by multidimensional scaling (MDS) over Euclidean distances between replicates and conditions (See Fig. 3B), which shows that the largest distances between conditions are between wildtype, TOS8 and SFL1. Together, PCA and MDS indicate that wildtype and TOS8 conditions have the most distinct variance (response) while other conditions are closer in terms of transcriptome-wide response. Hierarchical clustering of the expression matrix with (TPM > 5, See Fig. 3C), based on Pearson correlation as distance metric, corroborates these results, and indicated 4 major clusters: i) all 4 wildtype replicates, ii) TOS8, iii) SFL1/DIG1 and iv) ROF1/HEK2/SAN1. This result is also reflected from the rc values (see See Fig. 2A, right panel, mean values of rc for wildtype and TOS are the lowest).

Characterisation of differentially and commonly expressed genes between overexpressed conditions
It is noteworthy to identify differentially expressed genes between wildtype and the overexpressed conditions. One way to identify differentially expressed genes is through standardisation of gene expression values (Z-scores) [21]. In brief, the expression value (xij) of each gene (i) in each experiment (replicate or condition, j) is normalised to the mean expression values across all experiments (µi) and scaled by the standard deviation of expression values across all experiments (σi), that is: Zij = (xijµi)/ σi. We performed a log10 transformation of the data prior to normalisation to minimise outlier effects, and averaged the Z-scores of all replicates (28 samples x 6,328 genes) for each condition, to obtain an averaged normalised matrix (7 conditions x 6,328 genes).
In order to interpret Z-values, it is important to verify the normality of their distribution (See Fig. S5A): performing a QQ-plot showed most of the distribution approximates the standard normal distribution (mean = 0, s.d. = 1), except for |Z| > 2, indicating significant outliers (corresponding p < 0.05 assuming normal distribution, See   Fig. S5D). Hence, we identified differentially expressed genes with TPM > 5 and |Z| > 2 in at least one experiment, and found, out of the 6,328 genes, only 296 genes, corresponded to the local transcriptome response. This data, therefore, suggest that only a small proportion of the transcriptome is directly regulated by the overexpressed genes, and a large majority are only marginally affected.

Characterisation of local and global transcriptome responses in the different overexpressed conditions
To characterise the genes that are dissimilar between wildtype and other conditions and their functions, we first performed hierarchical clustering on the Z-score matrix of the 296 differentially expressed genes (TPM > 5 and |Z| > 2). We obtained 32 super-clusters (See Fig. S6A) that were then iteratively merged according to their similarity by minimising the correlation between clusters and maximising correlation within clusters. After the procedure (See Fig. S6B and C), we obtained 11 distinct clusters (See Fig. 4A), of which, 7 consist of upregulated genes in each condition, and 4 consist of down-regulated genes in wildtype, DIG1, SFL1 and TOS8 conditions. Note that down-regulated genes in wildtype indicate up-regulated genes in all overexpression experiments (and vice-versa).
The two largest clusters correspond to genes up-(n = 69 genes, cluster L4, See Fig. 4A and C) and downregulated (n = 128, cluster L1) in wildtype compared to all other conditions. That is, over the 296 differentially expressed genes, 197 were common between all overexpression conditions (67%), indicating a core set of genes that defines a non-specific response. Performing gene ontology (GO) using AMIGO2 [34] showed the up-regulated genes of the non-specific local response are involved in various mitochondrial processes, in particular translation, while the down-regulated genes play a role in the nucleotide and amine metabolic processes (Table S2). On top of this, TOS8 specifically regulated 80 genes (21 up-and 59 downregulated, clusters L2 and L5, 27%). Notably, genes involved in the fungal-type vacuole membrane were down-regulated by TOS8. In other conditions, very few other genes were specifically differentially expressed.   Table S2 and S3).
Other overexpressed conditions also regulated a small number of genes: DIG1 (alone, cluster L11) downregulated 6 genes (cluster L3) and SFL1 regulated 5 other genes (1 up, 4 down, clusters L8 and L9). HEK2 upregulated another gene (cluster L10), and so did SAN1 (cluster L6). ROF1 did not show any specific local response (cluster L7 only contained ROF1 gene). For these clusters, because of the low number of genes, GO could not be performed.
These results indicate that the local response is very limited to only a small number of genes, that, for a large majority, belong to very generic and aspecific biological processes, e.g. to support the transcriptional and translational response to an external or internal stress. Notably, we have previously shown that genes with lower expression changes can also play an important role in shaping global transcriptional responses [18,24]. It is, therefore, interesting to investigate the interplay between the local and global responses in the different overexpression conditions. We, thus, expanded the pool of differentially expressed genes by lowering the Z-values threshold to include genes with 1.5 < |Z| ≤ 2, which corresponded to 2,486 additional genes. We again performed hierarchical clustering of the Z-values resulting in 48-super clusters (See Fig. S6D), and finally obtained 6 distinct expression profiles (See Fig.  4B and Fig. S5E and F). Similarly, to the local response, we also observe a large group of genes that are commonly up-or down regulated in all conditions (1260 genes, clusters G2 and G4, See Fig. 4B and C). These genes belong to diverse biological processes involved in the global transcriptome aspecific response (transcription, translation, cell cycle, response to stress, etc., Table S3).
We also observe a number of clusters that are specific to certain overexpressed conditions. However, in contrast to the local response, where we found only onecondition-specific clusters, we see a more intertwined regulation of the clusters between the different conditions (See Fig. 4C). For example, 779 genes that are up-regulated in TOS8 condition (cluster G1) and involved in many metabolic processes, are also weakly down-regulated by HEK2 (See Fig. 4B and 4C). Conversely, 266 genes (cluster G6) enriched in processes such as protein folding are downregulated by TOS8 but upregulated by HEK2 (see Table S3). A similar pattern occurs between DIG1 and ROF1 conditions for 2 smaller clusters (99 genes enriched in fungal cell wall, and 82 genes in RNA modifications, clusters G3 and G5). We also observe that SFL1 does not only down-regulate genes in the global response but also weakly up-regulate genes that are up-regulated by TOS8. (cluster G1). Finally, the specific global response seems absent when overexpressing ROF1.
Overall, the analysis indicates that, on top of a very limited local response (296 genes, of which 197 are nonspecific to any overexpression condition), there exist a weak but co-regulated global transcriptome response that encompasses a large portion of the transcriptome (2,486 genes, 1260 non-specific) (See Fig. 4C). Most of the differentially expressed genes, either from the local or global response, belong to generic biological processes, such as transcriptional, translational, or metabolic processes, that are collectively activated or repressed upon external stress to adapt to diverse environmental changes (Table S2 and S3). Thus, despite each overexpressed condition showing some specific and strong local expression changes, the global transcriptome structure remains almost unaffected and gradually adapts in order to converge to a phenotypic stable state.

Discussion
Single cell microorganisms are free-living system often yielding uncorrelated (disordered) responses [31,33]. However, when they are exposed to environmental or pathogenic threats, a large number of species are able to aggregate, often within a self-secreted extracellular polymeric substances matrix or through quorum sensing, resulting in a highly ordered structure called biofilm [14].
In this paper, we investigated the global properties of the biofilm formed by the Saccharomyces cerevisiae. Numerous studies have investigated a number of mutants that regulate the growth of the biofilm, however, the conclusions do not point to a clear winner [15,30]. Here, we adopted a number of statistical metrics to analyse the transcriptomewide RNA-Seq expressions in wildtype and 6 overexpressed gene conditions (DIG1, SAN1, TOS8, ROF1, SFL1, HEK2) of the yeast strain F45.
After accounting for technical noise, we observed log-normal gene distribution (See Fig. 1). Using this statistical structure for expression threshold cut-off, we analysed a total of 6,328 gene transcripts. Pearson correlation reveals strong invariance not only between the same genotype replicates, but also between all genotypes (See Fig. 2). The strong invariance was also recapitulated by random selection

ReView by River Valley Technologies
Engineering Biology 2018/08/06 04:52:17 IET Review Copy Only of genes. However, when only the highly or lowly expressed genes were compared between genotypes, the correlations were noticeably lower. These data suggest that transcriptomewide response, across the expression spectrum, is scale-free and display long-range order observed in other complex physical systems [2,10,11,13]. Principal component and clustering (See Fig. 3) showed that TOS8 was the most distinct response from wildtype and any other overexpressed condition. Other conditions were closer to each other in terms of response, while still being quite distinct from wildtype. These simple approaches, thus, reveal the similarity and diversity between the different overexpression and wildtype conditions.
Focusing only on a selected highly expressed genes show short-range disorder. It also revealed that a majority of genes (67% of the local response, and 51% of the global response) are commonly regulated between all conditions and GO analysis reveals the distinct functions of the differentially regulated genes (See Fig. 4).
Overall, our analysis provided statistical evidence that biofilms are highly structured transcriptome-wide (not only in their physical appearance), and single mutant/overexpression regulation, such as the ones investigated here, are only able to propagate response locally. Hence, these mutants will likely provide only a transient success in biofilm control and may be subdued in the long run.
These data coincide with the notion of attractor states in biology [24,33,35], where there is a stable equilibrium state towards which a dynamical system tends to converge despite perturbation. Moving out of the attractor state requires the collective action of the entire system parameters, in this case, gene expressions [24,36]. Thus, we urge the microbiologist communities to tackle biofilm challenge adopting multidisciplinary approaches that will shed light not only on the local effects but also on the global regulation, and to identify novel targets for generating longrange disorder.

Statistical Distributions Fitting
Fitting of the distributions of experimental gene expression data was performed using the Maximumlikelihood Fitting method (fitdistr of the R MASS package [37] for Pareto, log-normal, log-logistic and Burr distributions. The double Pareto log-normal distribution fitting was performed using a general simulate-annealing approach (GenSA package [38]) to reduce error between experimental PDF curve and the simulated one.
The Pareto (power law) distribution [39] is characterised by 2 parameters, the exponent, α that determines the slope of the upper tail of the distribution, and a threshold xm below which the power law does not apply. The probability density function is defined as, for x > xm: The log-normal distribution is defined for random variable whose logarithm is normally distributed. It is characterised by 2 parameters, μ and σ, respectively the (log) mean and standard deviation of the underlying normal distribution. Its probability density function (PDF) is defined as: The log-logistic distribution [39] is defined for random variable whose logarithm has a logistic distribution. It is characterised by 2 parameters, α and β, respectively the scale and shape the underlying logistic distribution. Its PDF is defined as: The Burr distribution [39] is one of the generalised log-logistic distributions and is characterised by 3 parameters, α, β, and s, respectively the 2 shape parameters and the scale parameter. The PDF is defined as: The double Pareto log-normal distribution [40] arises out of a mixture of lognormal distributions shows Pareto power-law behaviour in both tails. It is characterised by 4 parameters, α, β, τ and ν, which stand for the shape parameters (exponents) of the 2 underlying Pareto laws, and the mean and standard deviation of the underlying log-normal distribution. The PDF is defined as: where Φ and Φ c are the cumulative density function (CDF) and complementary CDF of the normal distribution, N(0, 1).

Pearson correlation
The Pearson correlation coefficient r between two vectors (e.g. transcriptome in two different samples), containing n observations (e.g. gene expression values), is obtained by (for large

Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters using two methods: agglomerative and divisive algorithm. We used the former (Ward's) where each observation starts in its own cluster, and pairs of clusters are merged moving up the hierarchy. The Ward's method [41] starts with n clusters of size 1 and continues until all the observations are included into one cluster. It begins with the "leaves", looks for groups of leaves to form "branches" and work its way to reach the "trunk".
Here, the n differentially expressed genes (n=296 for local response, n=2486 for global response) were used to form clusters using the normalised gene expression standard scores, where xi,t is the expression value (TPM) of the i th gene at sample j, ̅ is the mean value in all samples and is the standard deviation.
In the first step, K clusters are formed (we chose K=32 and K=48 from the initial n=296 and n=2486 genes respectively) by cutting the tree (cutree function in R). The next step involves re-clustering the initial K clusters into K-1 clusters, so the two closest clusters (in terms of distance) are merged. The procedure is repeated until K=1. At each step, we measure the within-cluster correlations (the average correlation among genes that form each cluster), as well as between-cluster correlations (between average profiles of each pair of clusters). We chose the optimal number of clusters, K, such that the between-cluster correlations are low, and the within-cluster correlations are high. If K is too big, many clusters will have similar profiles between samples (with high between-cluster correlation). Conversely, if K is too small, the algorithm will start to merge clusters with heterogeneous gene profiles, and within-clusters correlations will drop. Tables  Table S1. Parameters of the distribution fitting.    Table S1.       ReView by River Valley Technologies Engineering Biology