Synthetic Core Promoters as Universal Parts for Fine-Tuning Expression in Different Yeast Species

Synthetic biology and metabolic engineering experiments frequently require the fine-tuning of gene expression to balance and optimize protein levels of regulators or metabolic enzymes. A key concept of synthetic biology is the development of modular parts that can be used in different contexts. Here, we have applied a computational multifactor design approach to generate de novo synthetic core promoters and 5′ untranslated regions (UTRs) for yeast cells. In contrast to upstream cis-regulatory modules (CRMs), core promoters are typically not subject to specific regulation, making them ideal engineering targets for gene expression fine-tuning. 112 synthetic core promoter sequences were designed on the basis of the sequence/function relationship of natural core promoters, nucleosome occupancy and the presence of short motifs. The synthetic core promoters were fused to the Pichia pastoris AOX1 CRM, and the resulting activity spanned more than a 200-fold range (0.3% to 70.6% of the wild type AOX1 level). The top-ten synthetic core promoters with highest activity were fused to six additional CRMs (three in P. pastoris and three in Saccharomyces cerevisiae). Inducible CRM constructs showed significantly higher activity than constitutive CRMs, reaching up to 176% of natural core promoters. Comparing the activity of the same synthetic core promoters fused to different CRMs revealed high correlations only for CRMs within the same organism. These data suggest that modularity is maintained to some extent but only within the same organism. Due to the conserved role of eukaryotic core promoters, this rational design concept may be transferred to other organisms as a generic engineering tool.

2/40 S1 -Summary of literature references on S. cerevisiae and P. pastoris core promoters.
In this section we will succinctly describe the main studies developed to clarify the mechanisms of yeast promoters, focusing on S. cerevisiae and P. pastoris. The references of S1 and S2 follow the numbering of the main text.
Park et al. developed a method for improved transcription start site (TSS) mapping, inferring relationships between core promoter cis-regulatory elements, chromatin features and TSS location 33 .
A study on the effect of 5'UTR features in gene expression was presented by Dvir et al. 34 . It has been shown that yeast core promoters show a high level of conservation, maintaining their functionality even in distantly related species 27 . As illustrative example, the S. cerevisiae LEU2 core promoter has been shown to remain functional when inserted in P. pastoris 8 . Different approaches have been followed to design synthetic promoters for protein expression finetuning. Some are focused on CRMs 37,38 , while others target both CRMs and core promoters 7,9,18,19,[39][40][41][42] . For CRMs design, the interaction between transcription factors (TFs) and respective binding sites (TFBSs) have been used to model transcription, either by creating a large library of promoters based on combinatorial arrangement of different TFBS upstream of the natural core promoter 38 or by generating orthogonal synthetic zinc fingers used to wire new synthetic transcriptional cascades 37 .
Also, the inclusion of regulatory sequences next to the core promoter can be used to fine-tune transcription. Other design approaches consisted in adding random mutations to a natural promoter (TEF) 7,9 , randomizing two specific areas of the PFY1 promoter and changing its expression profile afterwards by adding Tn10 Tet operator sites 41 or generating a minimal promoter based on a large scale screening of random sequences for minimal length, robustness and modulatory 19 has been used for the same purpose.

P. pastoris
Promoter libraries have been generated mostly based on deletions of CRMs sections to research the PAOX1 regulatory mechanisms and to determine TFBSs (e. g. 24,26 ). Promoter libraries have also been created to control gene expression by modifying the core promoter 16,25,43 and 5'UTR 44 sequeces.
Berg et al. studied random mutagenesis of PAOX1 16 , located in both core promoter and CRM regions.
They observed expression profile modifications (derepression) when mutating some specific nucleotides in the PAOX1 CRM, and modifications in the expression rate when mutating the core promoter sequence. Following a different approach, P. pastoris synthetic core promoters have also been designed based on four natural P. pastoris core promoters consensus sequence through the addition of some natural TFBSs 25 .

S2 -Detailed computational design of synthetic core promoters
Given that the core promoter design method was based on features from a genome wide list of S. cerevisiae natural core promoter sequences 28 it can be divided into three parts: a) Computation of features from the S. cerevisiae data set to be used for the synthetic core promoter design; b) Generation of core promoters sequences based on calculated data; c) Design space reductionselection of core promoter sequences to be tested in vivo.
As mentioned in the main text, several features were simultaneously incorporated in the design given that they were found to be correlated with maximal promoter activity 28 . The features included in the design process were: i) nucleotide occurrence along the sequence of 140 strong natural S. cerevisiae core promoters (as reported by 28 ), ii) the presence and position of the TATA box, iii) the position and number other motifs (other than TATA box, as defined by 28 ) and iv) nucleosome occupancy profiles 28,45 .
As described below, all the referred information was calculated in the first design part (a). However, the nucleotide occurrence, TATA box position and motif position and frequency was included in the core promoter design (b), while the nucleosome occupancy profile was included indirectly in the design process as a selection step (c).
It should be highlighted that only some of the motifs described by Lubliner et al. were added in this design process. The selection criteria were: strong reported correlation with maximal promoter activity and motif position within the desired core promoter region (from start codon to 150 bp upstream of it given that our core promoters' target length was 150 bp). The list of selected motifs and respective location is provided in Supplementary Tables 7-10.

a) Computation of features from the S. cerevisiae data set to be used for the core promoter design
Firstly, from the whole 729 native S. cerevisiae promoters' data set we focused on the 140 strong core promoters and respective 5'UTR. The sequences were trimmed to have a final length of 150 bp (corresponding to 50 bp downstream and 100 bp upstream the transcriptional start site (TSS)). From this subset we computed the: 1. Nucleotide probability distribution along the core promoter sequence -The frequency of each nucleotide was computed separately for consecutive promoter regions, in a sliding windows manner (windows size of 20 bp and windows step of 10 bp). The probability was calculated for each nucleotide and promoter region (frequency of each nucleotide for each promoter region was divided by the windows size). This resulted in a matrix of n x w, with n the number of nucleotides (4) and w the number of windows (14). The sum of these probabilities, column wise, was 1; 2. TATA box position distribution along the sequence -Considering the TATA box consensus sequence (TATAWAWR), all the occurrences location of this motif were annotated. A 4/40 Gaussian distribution model was inferred from this set of TATA box locations using the respective average (µ T ) and standard deviation (σ T ); For each of the   selected motifs listed on Supplementary Tables 7-10 a similar approach, as compared to the previous step, was used: annotation of number and positions of motif occurrences (respectively, f Mi and p Mi ,with i=1,2,..,7). For each set of frequency and positions a Gaussian distribution model was inferred (described by the respective average (µ fMi and µ pMi ) and standard deviation (σ fMi and σ pMi )); 4. Average nucleosome occupancy along the promoter sequence -The last step predesign computation was the natural nucleosome occupancy average profile. For this step a software package by 45 was used. With it, for each 140 natural core promoters a nucleosome profile were calculated. To avoid sequence edge related error, a 1000bp sequence (derived from the original cloning plasmid) was added to each side of the promoter sequences. The average nucleosome profile (µ Nj ) was calculated using the obtained 140 occupancy profiles (each profile consisted of 150 occupancy scores related with each nucleotide, j=1, 2, …, 150).

b) Generation of core promoters sequences based on calculated data
As mentioned in the main text, 4 different groups (named P, T, M, A) were designed useing the previously calculated information. They differ in the presence or absence of a TATA box and/or selected motifs (group P: without TATA box nor motifs; group T: with TATA box and without motifs; group M: with motifs and without TATA box; group A: with TATA box and motifs).
The sequence generation was computed as follows: 1. Random sequence generation -400 sequences, of 150 bp each, were generated with the MATLAB function randseq. This function had as input the vector of nucleotide probability w l (l=1, 2, …, 14) and the sequence length (equal to the window size -20). Thus, the randseq function was used 14 times to generate each sequence; 2. Removal of randomly occurring motifs (TATA box and selected motifs) -TATA boxes and any of the selected motifs were searched and replaced by a newly generated sequence. This procedure was repeated until no motif or TATA-box were found in the generated sequences; 3. Removal of randomly occurring start codons -Following the previous step approach Start codons upstream of the protein codon region were also removed to avoid frame shift mutations or different N-termini of the reporter protein; 4. Add Kozak sequence upstream of start codon -Due to the known relevance of the nucleotides adjacent to the start codon 34 , this region was replaced by the PAOX1 Kozak sequence (CGAAACG) in the generated sequences; 5. Separation of sequences in 4 groups -The 400 sequences were divided in 4 groups of 100 sequences each. The group P had no further modifications as it is characterized by not having TATA box or any other motifs; 6. Addition of a TATA box to groups T and A -For each sequence belonging to these groups, a TATA box position was generated (randn MATLAB function). The TATA box Gaussian 5/40 distribution was taken into account by multiplying the generated number with σ T and summing µ T .
One TATA box was inserted per core promoter sequence. The sequence originally located in this region was replaced by the TATA box. The sequences in group T had no further modifications as this group is characterized by having a TATA box and not having any other motifs; 7. Addition of motifs to groups M and A -In a similar way as in the previous step, the number of motifs and respective position was generated with the randn MATLAB function together with the respective average (µ fMi and µ pMi ) and standard deviation (σ fMi and σ pMi ). Thus, the frequency of each motif in each sequence also followed a Gaussian distribution model inferred from the natural sequences, meaning that some motifs might be present more than once while others might be absent in a given sequence.

c) Design space reductionselection of core promoter sequences to be tested in vivo
From the 100 sequences in each group, 28 were selected for experimental screening. For each of the 100 designed sequences a nucleosome occupancy profile was calculated (as described in a-4). Using the calculated profiles, the objective function that was used to select the 28 sequences was: Where µ Nj is the average nucleosome occupancy profile for natural core promoter sequences, µ sj is the nucleosome occupancy profile for each s (s=1, 2, …, 100) synthetic core promoter sequence along its j nucleotide position (j). The 28 sequences with a lower sum of squared errors were selected. With it we aimed to select for screening the designed sequences that were more similar to the natural promoters concerning the predicted nucleosome average occupancy.

28/40
Supplementary     The histograms show, that the cell populations are highly similar when comparing synthetic core promoters to each other and also to the natural AOX1 core promoter. Notably, all strains measured showed two separate fluorescence histogram peaks, indicating distinct cell population. These populations may be caused by the methanol inducible nature of our system: cells are at first grown on glucose and then induced with methanol. The different cell populations may be attributable to 'older' cells, having been grown on glucose and subsequently induced, and 'new' cells emerging from cell divisions after methanol induction and hence only grown under these conditions. These two peaks occurred for all core promoters tested (the synthetic ones and the native control) and hence these differences appear to be an inherent trait of the methanol induced yeast cells. The flow cytometry data was also used to calculate noise levels. For this purpose a squared coefficient of variance was calculated with the eight biological replicates of each strain. No clear difference was found between the synthetic core promoters and the wild type core promoter (data not shown).