Inverse design of viral infectivity-enhancing peptide fibrils from continuous protein-vector embeddings

Amyloid-like nanofibers from self-assembling peptides can promote viral gene transfer for therapeutic applications. Traditionally, new sequences are discovered either from screening large libraries or by creating derivatives of known active peptides. However, the discovery of de novo peptides, which are sequence-wise not related to any known active peptides, is limited by the difficulty to rationally predict structure-activity relationships because their activities typically have multi-scale and multi-parameter dependencies. Here, we used a small library of 163 peptides as a training set to predict de novo sequences for viral infectivity enhancement using a machine learning (ML) approach based on natural language processing. Specifically, we trained an ML model using continuous vector representations of the peptides, which were previously shown to retain relevant information embedded in the sequences. We used the trained ML model to sample the sequence space of peptides with 6 amino acids to identify promising candidates. These 6-mers were then further screened for charge and aggregation propensity. The resulting 16 new 6-mers were tested and found to be active with a 25% hit rate. Strikingly, these de novo sequences are the shortest active peptides for infectivity enhancement reported so far and show no sequence relation to the training set. Moreover, by screening the sequence space, we discovered the first hydrophobic peptide fibrils with a moderately negative surface charge that can enhance infectivity. Hence, this ML strategy is a time- and cost-efficient way for expanding the sequence space of short functional self-assembling peptides exemplified for therapeutic viral gene delivery.


Regression Model: Infectivity Prediction
LASSO and RIDGE regression models were trained on the training set peptides, represented as 100-d numerical vectors using continuous vector representations; the models perform similarly ( Figure S1). Since LASSO regression applies regularization by minimizing the number of non-zero coefficients, the resulting model contains only the relevant parameters. This results in a simpler model with fewer parameters. For example, for our model only 21 vectors have a non-zero coefficient (Eqn. 1). Interestingly, while RIDGE regression equation contains all the vector components, it offers slightly poorer correlation as shown in (Figure S1). All code and data used in ML training are openly available at https://gitlab.com/arghyadutta/seqto-infect.

Figure S1
A LASSO and B RIDGE linear regression models were trained via a 5-fold cross validation.
Aggrescan is based on statistical analysis of the aggregation-propensity value for each amino acid residue in the sequence and a subsequent aggregation prediction by hot spot regions, identified from the peptide aggregation profile. Here, we consider a sequence as amyloidogenic if there is at least one predicted hotspot.
Waltz applies statistical analysis of a sequence and was originally developed by position specific matrix for 1089 short 6-mer peptides sequences, which were experimentally determined for fibril formation. 7 Here, we considered a sequence as amyloidogenic if at least one amyloidogenic region was detected upon entry of following parameters: threshold custom 0-100 and pH 7.0.
Tango is designed to predict aggregating regions in unfolded polypeptide chains. statistical mechanics algorithm. The method is benchmarked against experimentally observed 179 peptides. 5 Here, we applied following input parameters to determine the β-sheet aggregation tendency (aggregation parameter): pH 7.4, 298 K, ionic strength 0.1724. We select a threshold above 5.0% over 5 residues to identify hotspots for aggregation as suggested by the authors to determine amyloidogenic sequences. 5 PATH is a structure-based method for predicting amyloidogenicity by threading and machine learning. Here, we considered a peptide as aggregating if at least one amyloidogenic region was calculated.
PASTA 2.0 is based on energetic functions which were determined experimentally from protein structures interactions potential and H-bond formation between all non-consecutive residues for parallel and anti-parallel -pairing. A sequence is considered amyloidogenic if the pasta energy for the lowest predicted pairing is lower or equal to the threshold stated by the authors (-4.0). 10 The parameters for the prediction was threshold custom, top pairing energy 20, energy threshold -2 PEU, large scale true, protein-protein analysis: false.
APPNN applies a neural network machine learning approach based on the analysis of seven physicochemical and biochemical features such as β-sheet frequency, hydrophobic moment, helix termination parameters or isoelectric point. A sequence was considered amyloidogenic if at least one of these six amino acid windows was classified amyloidogenic.
Except for Waltz, these prediction tools were developed based on a polypeptide and protein aggregation and not on short self-assembling peptides. To find the best performing tool for our self-assembling peptide library, we applied the experimental data on self-assembly by electron microscopy 4 (Table S1) as a dataset to evaluate the accuracy and reliability of each tool for selfassembly with the accuracy and receiver operating characteristic (ROC) value. The accuracy was calculated from the confusion matrix according to Eqn.2.
(Eqn.2) = ∑ + ∑ ∑ The ROC value for the prediction (Figure S2) was calculated with 10-fold stratified crossvalidation and the experimental fibril formation as target value and a logistic regression learner and LASSO regularization model (17 strength) with the data-mining software orange3. 11 The prediction tools Aggrescan, APPNN and PATH performed best with an accuracy of 76%, 69% and 69%, respectively. Even though these aggregation prediction tools are trained on polypeptides and proteins, the reported accuracy for these tools match well to our self-assembling peptide library composed of short peptides. Noteworthy, combining Aggrescan, APPNN and PATH increase the performance of aggregation further ( Figure S2). Therefore, we applied Aggrescan, APPNN and PATH to predict aggregation propensity of the de novo predicted 3669 sequences. A sequence was considered aggregating, if at least two of Aggrescan, APPNN or PATH were positive. By applying this method 424/3669 peptides were predicted for aggregation by at least two of these tools. Figure S3 the aggregation tools performed with comparable accuracy for the selected 16 peptides as determined by Aggrescan 75% for the training set ( Figure S3C) and 63% for the de novo predicted peptides ( Figure S3D). Figure S2 A Evaluation of the protein-aggregation tools Tango, 5 APPNN, 6 Waltz, 7 Path, 8 Aggrescan 9 and PASTA 2.0 10 with the training set (EF-C based library, Table S1). B ROC value for the prediction calculated with 10-fold stratified cross-validation and the experimental fibril formation as target value and a logistic regression learner and Lasso regularization model with 17 strength.

Figure S3
Aggregation prediction tools applied on 16 de novo created peptides. A Summary of aggregation prediction results, predicted infectivity according to ProtVec LASSO model Eqn. 1, calculated hydrophobicity and comments on selection criteria. B Comparison of experimental and predicted aggregation. Experimental aggregation was determined by TEM fibril formation. 8 peptides were predicted for aggregation and 8 peptides were not predicted for aggregation by at least two of the tools Aggrescan, APPNN and PATH. C The accuracy of aggregation prediction by applying at least two prediction tools is determined 75 %. D Aggregation prediction accuracy for Aggrescan only calculated by confusion matrix for predicted peptides is 63 %.

N-gram similarity for predicted peptides with training set
The N-gram sequence similarity of the net charge positive peptides (total 3669) predicted for infectivity enhancement with the training set was calculated to ensure a diverse selection of peptides semantically close and far away from the training set. The N-gram similarity factor quantifies the similarity of two strings and returns 0 for the same sequence and 1 for sequences Figure S4 Absolute abundance of amino acids in net charge positive peptides (total 3669) predicted for infectivity enhancement. Cysteine (C) and Tryptophan (W) are the most prevalent amino acids.

Figure S5
Overview of N-gram similarity values between the selected sequences and the training set. Average N-gram similarity describes the mean N-gram value between one selected sequence with every sequence of the training set. The highest similarity value shows the lowest value (highest similarity) for each selected sequence and the corresponding sequences from the training set. Values are colored gradually from red (0) -blue (1). which have nothing in common. We applied the algorithm by Kondrak 12 for 2-grams with the python script shown in https://github.com/luozhouyang/python-string-similarity.git In Table S5 a matrix of all 3669 peptides N-gram similarity values with the training is listed. As shown in Figure S5 the N-gram similarity values of the selected 16 peptides cover a wide range between 0.33 to 0.93 to quantify the diversity of selected sequences.

Evaluation of De Novo Peptide Activity with Property-Activity Model 4
Three of the newly predicted peptides show unexpected activity despite negative Zeta-potential.
To test whether these peptides follow a different mode of action a property activity model

Impact of Disulfide Bond Formation on Self-Assembly
Thiol groups from the side chain of cysteine can undergo disulfide bond formation with other thiol groups, which is known to influence self-assembly properties. 13 To study the impact of disulfide bond formation of cysteine rich short peptides, we applied tris(2carboxyethyl)phosphine (TCEP) 14 in 10 molar equivalents excess to break disulfide bonds, exemplarily studied for the peptide ICICLK. Transmission electron microscopy was performed to evaluate nanoscopic morphology, and brightfield microscopy was performed to evaluate microscopically large aggregation (Leica DMi8, 10x air objective). Surface charge and microscopic aggregation were evaluated via zeta-potential measurements.
Breaking disulfide bonds drastically changes the peptide assembly properties of ICICLK ( Figure  S10). Without disulfide bonds, no fibril formation (Figure S10 A) and no microscopic aggregation (Figure S10 B, D) can be observed, which also results in reduced surface charge (Figure S10 C).
Interestingly, for the original peptide EF-C in the training set, the addition of TCEP is has no visible influence on fibril formation and aggregation (Figure S10 F). This is likely due to the stabilizing effect of the alternating amphiphilic sequence pattern found in EF-C, which was identified earlier by us to drive assembly also without the presence of cysteine. 2,4 Thus, we conclude that disulfide bond formation is a critical feature for self-assembly of the newly identified 6-mer peptides.

Amino acid Composition Analysis
To explore potentially common amino acid compositions between highly active peptides in the training set and the newly discovered active 6-mer peptides, we conducted a simplified, coarsegrained analysis. This analysis calculates the percentage of charged, hydrophobic, and hydrogenbonding amino acids in peptides using the Hopp-Woods amino acid classification (Figure S11 A, Code S1). 15 Additionally, we coarse-grained the peptide activity into three thresholds: "high" (infectivity relative to EF-C > 0.7), "medium" (infectivity relative to EF-C > 0.1), and "low" (infectivity relative to EF-C < 0.1) active sequences.
The analysis of the training set revealed that peptides categorized as "high" and "medium" active contained a higher proportion of hydrophobic amino acids, while "low" activity peptides displayed a greater prevalence of charged amino acids (Figure S11 B). Remarkably, this same trend was observed in the de novo predicted peptides. The four active peptides, HVWCIF, HFICIC, ICICLK, and HICLFW, displayed a significantly higher content of hydrophobic amino acids compared to charged or hydrogen-bonding classified amino acids, which were predominantly found in non-active sequences (Figure S11 C).
It is worth noting that traditional prediction methods often rely on a predetermined set of descriptors. In contrast, the vector embedding approach employed in our study allows for the identification of underlying descriptors without any such assumptions. Therefore, a data-driven approach utilizing vector embeddings provides the flexibility to uncover latent descriptors that may have been not considered previously.

Figure S11
Comparison of the amino acid composition of the training set and the de novo predicted peptides.
A To quantify the amino acid distribution, the amino acids and the activity were coarse-grained. The amino-acid compositions of a peptide were calculated by counting the number of each Hopp-Woods type amino acids (charged, hydrogen-bonding, or hydrophobic) in it and normalizing each count by the peptide's length (Code S1). B The peptides of the training set were categorized in high, medium, and low active sequences. High active peptides have on average a higher percentage of hydrophobic amino acids. C The active de novo predicted sequences (bold) have a higher percentage of hydrophobic amino acids compared to inactive sequences.    Figure S6) and -sheet content is determined by ATR-FT-IR spectroscopy ( Figure S9) Table S3 contains information on top 12320 sequences from Monte Carlo ProtVec LASSO model screening with information on predicted infectivity, hydrophobicity, and net charge and is openly available at the following data repository DOI: 10.5281/zenodo.7708290 Table S4 contains information on top 3669 peptides with a net positive charge with information on aggregation prediction results from Aggrescan, APPNN, and PATH and is openly available at the following data repository DOI: 10.5281/zenodo.7708290 Table S5 contains information on N-gram similarity matrix composed of top 3669 peptides and 163 peptides from the training set and is openly available at the following data repository DOI: 10.5281/zenodo.7708290

Predicted Peptides Characterization
Code S1 is a python script for calculating the amino acids composition of charged, hydrogen bonding, and hydrophobic amino acids in a peptide sequence library according to Hopp-Woods classification. 15 The code and the corresponding training set of coarse-grained peptides are openly available at the following data repository DOI: 10.5281/zenodo.8004720