yGPS-P: A Yeast-Based Peptidome Screen for Studying Quality Control-Associated Proteolysis

Quality control-associated proteolysis (QCAP) is a fundamental mechanism that maintains cellular homeostasis by eliminating improperly folded proteins. In QCAP, the exposure of normally hidden cis-acting protein sequences, termed degrons, triggers misfolded protein ubiquitination, resulting in their elimination by the proteasome. To identify the landscape of QCAP degrons and learn about their unique features we have developed an unbiased screening method in yeast, termed yGPS-P, which facilitates the determination of thousands of proteome-derived sequences that enhance proteolysis. Here we describe the fundamental features of the yGPS-P method and provide a detailed protocol for its use as a tool for degron search. This includes the cloning of a synthetic peptidome library in a fluorescence-based reporter system, and data acquisition, which entails the combination of Fluorescence-Activated Cell Sorting (FACS) and Next-Generation Sequencing (NGS). We also provide guidelines for data extraction and analysis and for the application of a machine-learning algorithm that established the evolutionarily conserved amino acid preferences and secondary structure propensities of QCAP degrons.


Introduction
Protein misfolding and aggregation are fundamental problems that every cell faces. Consequently, cells employ various co-and post-translational protein quality control (PQC) mechanisms to mitigate this problem. These include molecular chaperones that can repair misfolded proteins and return them to a functional state and proteolytic systems that recognize misfolded proteins and eliminate them through degradation. PQC pathways have evolved to be extraordinarily robust because they must maintain protein homeostasis (proteostasis) indefinitely, despite stochastic external and internal variations that challenge protein folding. When PQC pathways demonstrate low performance or when they are otherwise overwhelmed, misfolded proteins accumulate and often form insoluble aggregates that can lead to cellular dysfunction [1]. The deleterious consequences of a cell's failure to cope with misfolded proteins are apparent in all living organisms, from bacteria to humans [2]. The importance of this failure is best highlighted by the >50 devastating human diseases causally linked to protein aggregation [3].
One strategic way eukaryotic cells cope with misfolded proteins burden is by destroying them through Quality Control-Associated Proteolysis (QCAP). For cellular QCAP to function efficiently, cells must have the capacity to destroy misfolded proteins quickly and at the same time ignore normally folded proteins [2]. QCAP is typically executed by the ubiquitin-proteasome system (UPS), which involves specific ubiquitin ligases that recognize and mediate the ubiquitination of misfolded proteins for subsequent destruction by the 26S proteasome [4]. However, the UPS is also involved in the degradation of normally folded proteins for physiological control [5]. This means that, for the UPS to respond correctly to varied physiological conditions, it requires diverse and sometimes conflicting features from its substrates [6]. On the one hand, there is a need for stringent sequence specificity to degrade proteins involved in controlled cellular processes (e.g., transcription, signaling, cell-cycle progression, etc.) with precise temporal and spatial accuracy. On the other hand, any protein can misfold and broad flexibility in the recognition process is essential for rapidly and efficiently targeting proteins undergoing non-native conformational changes for degradation without consequences to normal proteins. Thus, high specificity is still required for QCAP to avert the destruction of normal and fully functional proteins.
A key question is how QCAP pathways handle such a diverse array of misfolded substrates without sacrificing normal proteins. The main hypothesis in the PQC field is that QCAP systems recognize the biophysical phenomenon of exposed hydrophobicity [7]. This is based on the biochemical property that soluble folded proteins place hydrophobic residues in their interior to form a hydrophobic core, away from the aqueous solution. Thus, it is not surprising that the exposure of hydrophobic residues is a key feature of QCAP degrons [8][9][10][11][12]. However, "exposed hydrophobicity" is an exceptionally broad term that can encompass many distinct features that might be overlooked if using this superficial term as a "catch-all" for misfolded protein recognition. In fact, it is our working hypothesis that not all exposed hydrophobicity is the same and that the majority of the eukaryotic proteome contains cryptic QCAP degrons that are exposed upon stress induction.
The main obstacle in identifying QCAP degrons, let alone their mechanism of function, is that they cannot be deduced from basic features of already known degrons, due to their sequence and structural heterogeneity. To overcome these challenges, we have recently implemented the Global Protein Stability peptidome (GPS-P) technology in yeast, termed yGPS-P, to identify QCAP degrons and their cognate degradation machinery [13,14]. In yGPS-P, a synthetic peptide library that emerged from the yeast ORFome is fused Cterminally to a GFP reporter, while a Cherry fluorescent protein precedes the GFP on the same mRNA transcript. While both Cherry and GFP are transcribed to a single mRNA, they are translated separately. Hence, alterations in the GFP/Cherry ratio upon peptide appendage indicate an increased elimination rate of the GFP. We used the yGPS-P system to identify a large cohort of internal degrons and thus to decipher rules governing the stability of the eukaryotic proteome. Here, we provide a detailed protocol of this system from the library design to the revelation of principal features of internal yeast QCAP degrons, including the preferred amino acid and secondary structure properties.

Experimental Design
Below described are our considerations for the establishment of the yGPS-P method, which includes three primary steps: (1) Generation of a peptidome library; (2) Data collection (3) Data analysis: (1) Generation of a peptidome library Adaptation of GPS-P to yeast was achieved by us by utilizing a bicistronic gene expression vector, where yeast-enhanced Cherry and similarly, yeast-enhanced GFP are expressed from a single transcript but translated separately [15]. It was possible due to the presence of the 5 leader of the mRNA-encoding initiation factor eIF4G upstream to yeG [15]. This factor functions as an Internal Ribosome Entry Site (IRES) [16] that enables the initiation of translation in a cap-independent manner ( Figure 1). As a source for a peptide library, we employed a set of 326 yeast proteins from various cell compartments. These proteins were chosen since they are part of a protein complex. Initially we hypothesized that these proteins would be enriched with degrons at the protein-protein interaction interface, however, no significant differences in degron abundance were found between monomeric proteins and proteins in a complex (not shown). A list of the selected proteins has been provided as Supplemental Data 1 in a manuscript published by Mashahreh et al. in 2022 [14]. Sequences of the selected proteins were organized as tiled peptides of 17 amino acids (Figure 1). This length was chosen since it is a minimal length that is still anticipated to gain a 2D structure. DNA fragments corresponding to the said peptides were synthesized as an Oligo mixture and inserted into yGPS-P vector at the C-terminus of the stop-codon-less GFP by the Gibson assembly reaction. Library validation and degron presence were previously described [14]. monomeric proteins and proteins in a complex (not shown). A list of the sel has been provided as Supplemental Data 1 in a manuscript published by M in 2022 [14]. Sequences of the selected proteins were organized as tiled amino acids (Figure 1). This length was chosen since it is a minimal leng anticipated to gain a 2D structure. DNA fragments corresponding to the were synthesized as an Oligo mixture and inserted into yGPS-P vector at t of the stop-codon-less GFP by the Gibson assembly reaction. Library v degron presence were previously described [14]. (2) Data collection Plasmids were transformed into the baker's yeast Saccharomyces Cerevis to a library of 21,200 different peptides, which is ~90% of the DNA synthes ratio between GFP and Cherry was determined by flow cytometry and/or Activated Cell Sorting (FACS) (Figure 2a,b). As both Cherry and GFP are e a single transcript, yet translated independently, Cherry levels reflect the ba of the reporters while the GFP-to-Cherry ratio determines the relative GF governed by the fused peptide. Arbitrarily, four distinct populations (Bins fined, exhibiting low to high GFP-to-Cherry ratios, respectively ( Figure 2b). ulation was isolated by FACS, and purified plasmids were then amplified, ure 2c), and subjected to next-generation sequencing (NGS) (Figure 2d) DNA content of the corresponding peptides (sequence and abundance) wi (2) Data collection Plasmids were transformed into the baker's yeast Saccharomyces Cerevisiae, giving rise to a library of 21,200 different peptides, which is~90% of the DNA synthesized. Next, the ratio between GFP and Cherry was determined by flow cytometry and/or Fluorescence-Activated Cell Sorting (FACS) (Figure 2a,b). As both Cherry and GFP are expressed from a single transcript, yet translated independently, Cherry levels reflect the basal expression of the reporters while the GFP-to-Cherry ratio determines the relative GFP level that is governed by the fused peptide. Arbitrarily, four distinct populations (Bins 1-4) were defined, exhibiting low to high GFP-to-Cherry ratios, respectively ( Figure 2b). Each cell population was isolated by FACS, and purified plasmids were then amplified, barcoded (Figure 2c), and subjected to next-generation sequencing (NGS) (Figure 2d) to identify the DNA content of the corresponding peptides (sequence and abundance) within each bin.  [14]. Sequences of the selected proteins were organized as tiled peptides of 17 amino acids ( Figure 1). This length was chosen since it is a minimal length that is still anticipated to gain a 2D structure. DNA fragments corresponding to the said peptides were synthesized as an Oligo mixture and inserted into yGPS-P vector at the C-terminus of the stop-codon-less GFP by the Gibson assembly reaction. Library validation and degron presence were previously described [14].

(2) Data collection
Plasmids were transformed into the baker's yeast Saccharomyces Cerevisiae, giving rise to a library of 21,200 different peptides, which is ~90% of the DNA synthesized. Next, the ratio between GFP and Cherry was determined by flow cytometry and/or Fluorescence-Activated Cell Sorting (FACS) (Figure 2a,b). As both Cherry and GFP are expressed from a single transcript, yet translated independently, Cherry levels reflect the basal expression of the reporters while the GFP-to-Cherry ratio determines the relative GFP level that is governed by the fused peptide. Arbitrarily, four distinct populations (Bins 1-4) were defined, exhibiting low to high GFP-to-Cherry ratios, respectively ( Figure 2b). Each cell population was isolated by FACS, and purified plasmids were then amplified, barcoded (Figure 2c), and subjected to next-generation sequencing (NGS) (Figure 2d) to identify the DNA content of the corresponding peptides (sequence and abundance) within each bin.   (3) Data analysis NGS data enables the quantification of peptide levels in each FACS bin. With adequate coverage, scoring values were assigned to each peptide, representing their average bin location, termed Protein Stability Index (PSI) [17]. Thus, the PSI value of each peptide is a relative measure of propensity to contribute to the degradation of the otherwise stable GFP. PSI values facilitated the binary classification of each peptide as a degron or non-degron (PSI ≤ 2.2, PSI ≥ 2.8, respectively) ( Figure 3a). The resulting classified data was used to train a machine learning model, which provided a predictor for each amino acid within a given protein to be part of a degron (QCDPred [13]). For example, both experimental and QCDPred data for the deubiquitinase (DUB) protein Ubp6, denoted the presence of two potential degrons (Figure 3b, p > 0.85). Subsequent assignment of these regions to the Ubp6 3D structure (PDB 7QO5) [18], indicated that one is buried within the protein core while the other forms a surface-exposed hydrophobic stretch that likely interacts with other, yet-to-be-identified, protein(s) (Figure 3c).
Biomolecules 2023, 13, x FOR PEER REVIEW 4 of (3) Data analysis NGS data enables the quantification of peptide levels in each FACS bin. With ad quate coverage, scoring values were assigned to each peptide, representing their avera bin location, termed Protein Stability Index (PSI) [17]. Thus, the PSI value of each pept is a relative measure of propensity to contribute to the degradation of the otherwise sta GFP. PSI values facilitated the binary classification of each peptide as a degron or no degron (PSI ≤ 2.2, PSI ≥ 2.8, respectively) ( Figure 3a). The resulting classified data w used to train a machine learning model, which provided a predictor for each amino a within a given protein to be part of a degron (QCDPred [13]). For example, both expe mental and QCDPred data for the deubiquitinase (DUB) protein Ubp6, denoted the pr ence of two potential degrons (Figure 3b, p > 0.85). Subsequent assignment of these regio to the Ubp6 3D structure (PDB 7QO5) [18], indicated that one is buried within the prot core while the other forms a surface-exposed hydrophobic stretch that likely interacts w other, yet-to-be-identified, protein(s) (Figure 3c).

Materials and Equipment
Here we describe all materials and key equipment needed to perform the yGPS-P screen. This includes bacteria and yeast strains (Table 1), Primers (Table 2), Commercial kits (Table 3), chemicals (Table 4), Key equipment (Table 5), and yeast and bacteria growth media (Tables 6-8).   Growth media: Table 6. Luria Broth (LB) *.

Library Design and Ordering
As a source for the peptide library, we chose 326 yeast proteins. The coding DNA sequence of each protein was divided into 51 bp fragments with 36 bp overlaps between neighboring oligonucleotides. All fragments were synthesized by LC Sciences (Houston, TX, USA) to be flanked with two primer-binding regions complementary to the yGPS-P expression vector to enable Gibson assembly

Library Amplification
Timing: 3 h. This step describes the amplification and clean-up of the library pool.

1.
Conduct the following PCR to amplify the library: Troubleshooting: if no band is visible, increase the concentration of the library pool or add more cycles (up to 24 cycles total).

4.
Clean up the PCR product with NucleoSpin Gel and PCR Clean-up kit, according to the manufacturer's instructions, and measure the concentration by NanoDrop.
Pause Point: The DNA may be stored at −20 • C.

Vector Linearization
Timing: 1 day This step describes how to perform the restriction digestion and the clean-up of the plasmid.

5.
Set up the restriction digest reaction as follows: Troubleshooting: if the plasmid is not fully digested, increase the restriction duration, and/or add more restriction enzymes.

8.
Excise the band of the linearized plasmid from the gel using a sterile scalpel.
Note: Perform the cutting as quickly as possible to avoid DNA damage by UV.

9.
Extract the plasmids from the gel using the NucleoSpin Gel and PCR Clean-up kit, according to the manufacturer's protocol. Quantify the product by A260 using a nanodrop spectrophotometer. Expected DNA recovery is approximately 60%.
Pause Point: The DNA may be stored at −20 • C.

Gibson Assembly Reaction and Efficiency Transformation to E. coli
Timing: 3 h.
This step describes the cloning of the library pool into the plasmid by Gibson assembly, followed by clean-up and the transformation into electrocompetent bacteria cells. 18. Electroporate the cells with the ECM 399 electroporator system (BTX) using a 2 kV pulse. 19. Immediately, add the recovery medium to the cuvette and invert three times. 20. Transfer the mixture to Eppendorf tubes and incubate at a 37 • C rotating incubator for a 1 h recovery. 21. Spin down the cells and discard 700 µL from the supernatant. Spread the remaining bacteria on 120 × 120 mm Petri dishes containing LB medium with 50 µg/mL ampicillin for plasmid selection. 22. Incubate the plates upside-down for 16 h at 37 • C.
Note: plate the bacteria instead of growing them in a liquid medium to prevent bacteria competition for resources which may lead to the depletion of some sequences. 37. Plate the transformation mixes on 120 × 120 mm SD-Leu plates. Incubate the plates for 2-4 days to recover transformants. 38. Scrape all colonies from the plates and combine them in a 50 mL Falcon tube. Centrifuge at 3000× g and remove supernatant. Resuspend with 5 mL SD-leu, add 25% glycerol, and keep at −80.

FACS Isolation of Cells Having Different GFP/Cherry Ratios on BD FACSAria
Timing: 1 day This part describes the method for sorting yeast cells into four populations based on their GFP to Cherry ratio. On the day of sorting, optimum sorting accuracy was attained by performing an automatic drop delay setup with BD Accudrop beads (BD Biosciences, Catalogue No. 345249) according to the manufacturer's recommendation. Cytometer Setup &Tracking was performed before analysis using the CS&T beads (BD Biosciences, Catalogue No. 655050). A nozzle of 70 µm should be used while sorting.
39. Grow cells for 5-6 h and dilute to OD600 of~0.002-0.005 in 10 mL SD-Leu. 40. Grow cells at 30 • C for about 14-18 h to reach 0.5-0.7 OD. 41. Keep cells at Room temperature for 30 min before sorting. 42. Add 0.01% sodium azide. 43. Run cherry-expressing and GFP-expressing cells to exclude cherry-negative and GFPnegative cells, respectively. 44. Gate single cells based on their forward and side scatter parameters. 45. Run the empty plasmid containing cells to optimize the population's gating. 46. Divide the cells into four bins based on their GFP to mCherry ratio. 47. Collect 2 million of each population in 4 mL tubes that have 0.5 mL of SD-leu media.

Plasmid Extraction and Library Preparation for Sequencing
Timing: 2 days In this step, plasmids were purified from each population and then used to generate amplicons suitable for Illumina NextSeq sequencing. Illumina adapters and indexation were added to the DNA library through two rounds of PCR.
48. Collect sorted cells by centrifugation at 10,000× g for 5 min. Carefully take out the supernatant by slow pipetting. 52. Load 2 µL of the reaction on a 2% TAE/agarose gel and run at constant 100 V for 30 min. 53. Purify the PCR product using AMPure XP beads. Quantify the DNA using NanoDrop. 54. Perform and run the second PCR reaction with the Illumina primers to add the indexation. In the second PCR, use 50 ng of DNA templates and run the PCR program for 12 cycles. 55. Purify the PCR product using AMPure XP beads and quantify the DNA. 56. Pool the PCR products at a 1:1 ratio and load on 2% TAE/agarose gel. Excise the band of the library from the gel using a sterile scalpel and extract the DNA from the gel. 57. Quantify the concentration and send it for sequencing. We recommend a 4 nM final concentration of your sequencing library.

Peptide Alignment and Quantification
An efficient data structure can be used to organize the sequencing data.
58. Trim nucleotide prefixes, including PCR barcodes, to yield DNA fragments consisting of shorter and more relevant sequences. 59. Use the Bowtie2 Python package for alignment, allowing a single mismatch, to yield a high coverage of the library. 60. Assess the appearance of each of the assigned peptides in the different FACS bins.
Normalize the total read number of each bin to 1 × 10 6 reads/bin.

Extraction of Protein Stability Index (PSI)
61. To calculate the contribution of each peptide to the stability of the fused GFP, determine its average position between the four bins, termed protein stability index (PSI) [17], using the equation: PSI i = where g is the gate number, ·f i,g is the frequency of the peptide i at the gate g. This yields a number between 1-4 that reflects peptide stability. The PSI of each 17 amino acid tile was assigned to the central amino acid (position nine; Figure 3b, black dots).
62. To obtain PSI for the peptide-devised proteome, average PSI values of all peptides harboring each amino acid (Typically, three tiles). Window-smooth this profile by a five-residue (plus/minus two) running median (Figure 3b, black line) [13,14].

Demonstration of the Use of a Training Model
A machine learning approach, using logistic regression, was employed to calculate the probability of any region in the proteome contributing to protein degradation and further disclose the unique characteristics of degron determinants.
Empirical PSI values served as the training dataset. Fragments with low read count (<50 counts at all FACS bins combined) were filtered out. The data was classified as stable or unstable, based on their PSI values. Any value smaller than 2.2 was considered unstable and was classified as "one" on a binary scale, while values greater than 2.8 were considered stable and classified as "zero". Mid-range values (2.2 < PSI < 2.8) were omitted from the training set to reduce noise in the output model. To prevent overfitting behavior, ridge regularization with a value of λ = 0.001 was used for training.
The resulting product of the training model is a function that returns a probability score for a peptide to be a degron. As this function receives a peptide with a length of 17 amino acids and returns a value between zero and one, each amino acid residue within a protein (excluding the eight amino acids at the ends) can receive a prediction value. Together, these values can be fitted into a curve (using the cfit function in Matlab) to yield a continuous trending line of stability along the protein (see Figure 3b).

Demonstration of Data Analysis
We applied the QCDPred on the yeast proteome, which was retrieved from UniProt [20], comprising a total of 6617 protein sequences. Then, we used the peptide sequences and their degron probability to extract different features such as: (1) Hydrophobicity The correlation between degron probability and hydrophobicity of peptides was assessed using the Kyte-Doolittle scoring chart [21] (Figure 4a), however, any other hydrophobicity scale can be used. To do so, Kyte-Doolittle scores of each amino acid within each peptide were summed and averaged over the peptide length to obtain the peptide hydrophobicity score.

Demonstration of Data Analysis
We applied the QCDPred on the yeast proteome, which was retrieved from U [20], comprising a total of 6617 protein sequences. Then, we used the peptide seq and their degron probability to extract different features such as: (1) Hydrophobicity The correlation between degron probability and hydrophobicity of peptides sessed using the Kyte-Doolittle scoring chart [21] (Figure 4a), however, any other phobicity scale can be used. To do so, Kyte-Doolittle scores of each amino acid each peptide were summed and averaged over the peptide length to obtain the p hydrophobicity score. (2) Secondary Structure Analyses To determine the secondary structure of peptides/degrons (helix, beta shee (2) Secondary Structure Analyses To determine the secondary structure of peptides/degrons (helix, beta sheet, turn, bend, others), structural data needs to be collected. In our work, the AlphaFold 2.0 database was used to retrieve structural data for all proteins (out of complex) to yield more accurate results.
Using the AlphaFold-generated CIF structure file, the secondary structure can be determined for every amino acid in a protein and intersected with other properties such as degron prediction to obtain the secondary structure prevalence within the degron population.
(3) Accessible Surface Area (ASA) Accessible Surface Area can be determined by utilizing the AlphaFold dataset and the FreeSASA 2.0 Python tool [22]. The algorithm assigns an ASA value to each amino acid of the yeast proteome and determines its accessibility region to the cytoplasm based on 3D atom distances, inferred from the related AlphaFold PDB files.

Results
The yGPS-P method was previously used for identifying QCAP degrons in the yeast Saccharomyces Cerevisiae. Using a synthetic library, fused to the C-terminus of GFP we isolated four plasmid populations with distinct GFP-to-Cherry ratios. Scoring each peptide, we next determined their PSI score. Using an advanced machine-learning algorithm, each amino acid in the yeast proteome was scored for its probability to be part of a degron. The new information was used for determining the amino acid and secondary structure preferences of the yeast degronome. We found that the QCDPred algorithm points to a preference for hydrophobic amino acids (Figure 4a). Cross-referencing the predicted yeast degronome with the AlphaFold yeast structure database [23,24], we detected enrichment of alpha-helices in QCAP degrons (Figure 4b). To evaluate the exposure and accessible surface area of amino acids within any protein, the AlphaFold structural PDB was used. We found that, in general, degrons are less solvent accessible, compared to the entire proteome ( Figure 4c).
An important feature of the yeast QCAP ubiquitination machinery is that, unlike humans, it does not significantly encompass E3 ligases of the cullin RING family that recognize C-terminal amino acids as degrons. Consequently, C-terminal degrons were not enriched in the degron population of the yGPS-P screen [14]. Instead, QCAP degrons were found to be mostly internal and their QCDPred projected localization fit well with the published literature. The substantial enrichment of degrons in alpha helix secondary structure (Figure 4b), implies that a significant portion of the fused peptides can gain a secondary structure in the context of the GFP fusion. Thus, our data indicate that a C-terminally fused library of peptides that emerged from full-length proteins can be used to determine internal QCAP degrons.
Our observation that QCAP degrons are less susceptible to the solvent is not surprising; it is now well established that proteins tend to fold with their hydrophobic regions frequently buried at the protein core [25]. This raises the question of how QCAP degrons are being exposed during targeting of the UPS. Unlike an isolated environment, in which misfolded proteins can gain multiple extensively misfolded protein states [26], in vivo, proteins are likely to be in constant equilibrium between partially/mildly misfolded states, owing to barriers in their energy landscape. It is plausible that mild changes to the protein structure, mostly affecting its tertiary organization, are sufficient to expose well-folded hydrophobic helices that trigger the recruitment of E3 ligases. However, the underlying mechanisms and the contribution of protein amino acid sequences and structures to in vivo protein misfolding are yet to be determined.
Notably, yGPS-P can be utilized for other applications as well. For example, it can be used to identify stabilizing and/or destabilizing amino acids within a protein of interest, as previously conducted for the yeast Mat alpha2 protein [27]. Low-fidelity PCR or saturated mutagenesis can be used for preparing a large number of variants that can be sorted by FACS. Changes in PSI compared to the wild-type protein will indicate effects on stability. This is a very sensitive and robust way of determining the importance of specific amino acids to overall protein stability.