PARROT: a flexible recurrent neural network framework for analysis of large protein datasets

The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex non-linear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems.


INTRODUCTION
The past decade has seen an exponential increase in the rate at which biological data is generated [1]. Technological advances coupled with the falling costs of DNA synthesis and sequencing have made conducting high-throughput experiments accessible to most research labs [2]. The affordability of being able to sequence massive quantities of DNA is transforming how molecular biologists approach research. Protein functional assays and screens are seeing increasing library sizes, which allows researchers to investigate many different sequences and variants in a single experiment. In recently published studies, it is not uncommon to find deep mutational scanning (DMS) experiments that achieve nearly complete sequence coverage, or assays that test tens of thousands of peptides [3][4][5][6][7][8][9][10]. This abundance of data being generated has the potential to answer important biological questions, however, at the same time it also significantly complicates experimental analysis.
Coinciding with the explosion of high-throughput omics experiments has been the development of computational methods for analyzing the resulting high-dimensional biological data. In particular, machine learning approaches have emerged as popular strategies in a wide range of biological applications [11][12][13]. In general, machine learning approaches are effective at identifying patterns in complex datasets and extrapolating these learned patterns to make predictions on previously untested samples. Deep learning approaches, as opposed to "shallow" machine learning approaches, such as logistic regression, are particularly well-suited for biological applications as they can implicitly capture relevant features in order to model complex, non-linear, biological relationships [14][15][16]. In the context of protein datasets, deep learning approaches offer the attractive quality of allowing researchers to simply input raw protein sequences into the model, rather than requiring an intermediate step where proteins are reduced into simplified representations (e.g. amino acid content or biophysical properties) [15].
However, despite their advantages over simpler models, deep learning approaches are still a relatively specialized form of data analysis. As a result, in many domains of biological sciences, there remains a technical and conceptual barrier for labs to apply deep learning approaches to their data. In some cases, this could be reasonably attributed to preference for more interpretable simple models, rather than more accurate, but often cryptic, deep learning models [17,18]. In other cases, this lack of adoption could be due to a general unfamiliarity and inexperience with deep learning. Indeed, the field of deep learning can appear daunting for those without extensive computational backgrounds. For an untrained scientist with amenable high-throughput datasets, it may be infeasible or too time-consuming to implement deep learning models into an analysis workflow.
Here, we aim to make cutting-edge deep learning accessible to a broad audience of biological researchers through our package PARROT (Protein Analysis using RecuRrent neural networks On Training data). PARROT is designed to be a general framework for training machine learning networks on large protein datasets, then using the trained network to make predictions on new protein sequences. The user side of PARROT is an easy-to-use command line tool that is flexible enough to handle a variety of data formats and machine learning tasks. In its implementation, PARROT carries out the computational heavy lifting through implementation of a recurrent neural network (RNN). RNNs are a class of deep learning architecture originally designed for language processing applications, but have since been employed with remarkable success in biology [19][20][21][22][23][24][25][26]. Compared to other deep learning approaches, RNNs are unique in that they are designed to handle variable length sequences which makes them well-suited for applications involving proteins. Using only raw protein sequences as input, RNNs can learn the relevant positional dependencies of amino acids needed to associate each sequence with a corresponding functional value or values. Through this architecture, PARROT is able to capture intrinsic patterns in large protein datasets in order to construct highly accurate predictive models.
In this paper, we introduce the underlying RNN architecture of PARROT and demonstrate its application to three different biological problems. First, we show that PARROT performs at a near state-of-the-art level on phosphorylation site prediction tasks, a well-characterized bioinformatics problem. Second, we use PARROT to train a predictor of transcriptional activation activity using the extensive peptide library from Erijman et al. [5].
Third, we demonstrate how PARROT can be used in conjunction with DMS assays, using the amyloid-beta-based dataset from Seuma et al. [8]. Ultimately, we show that PARROT is an effective, generalizable, and easy-to-use machine learning tool that is applicable to a range of different protein datasets.

PARROT is a general RNN framework
Our motivation behind PARROT was to develop a powerful deep learning tool that is easy to implement into any large-scale protein analysis workflows (>1,000s of sequences) ( Figure 1A). The general workflow involves the following steps. A user starts with a set of sequences of interest where each sequence (or each residue in each sequence) has some label associated with it, either a discrete class or a continuous value. PARROT uses this initial dataset to train, validate, and test a deep learning model. Training, validation and testing is all performed automatically within PARROT using standard best practices for machine learning model generation. Once a model is built, the user can use that model to make predictions on new sequences for which there is no data associated.
We used the PyTorch platform to implement the core RNN framework of PARROT (Paszke et al. 2019). The serialized architecture of RNNs and their ability to handle variable length inputs makes them well-suited for learning information from protein sequences. In the context of protein analysis, each cell in an RNN integrates information from a particular amino acid with the output ("hidden state vector") of the preceding cell in the network. However, there are two main drawbacks with using the standard RNN architecture for protein analysis. First, standard RNNs require that information is propagated through the network in a single direction, which imposes an arbitrary constraint on the ability of a network to learn from protein sequences.
Second, standard RNNs are susceptible to the "vanishing gradient problem", which arises due to the multiplication of many small values and can limit the ability of a network to learn long-range dependencies in the data [27]. PARROT implements several common variants of RNN architecture in order to mitigate these factors ( Figure 1B). To address the first, the RNN implementation of PARROT is bidirectional, such that there are two parallel RNNs that propagate information from the protein sequence in opposite directions (N-to-C and C-to-N) simultaneously [28]. To address the issue of vanishing gradients, PARROT employs Long Short-Term Memory (LSTM) cells, which have been shown to capture long-range dependencies in sequences more efficiently than standard RNNs [29]. Combining bi-directionality with LSTM cells has been previously demonstrated to be effective at learning from biological sequences [21][22][23][24]26]. Taken together, PARROT's underlying network architecture is specifically optimized for working with protein datasets.
PARROT was designed to conceal the inner-workings of this RNN, such that only a limited set of information is required from the user. For the most basic usage, the user only needs to provide their data and specify the kind of machine learning task for which they are training the network (classification or regression, described below). User datasets are input as basic whitespace-delimited text files with each protein sequence and its corresponding data contained on a single line. This file can be prepared in any spreadsheet program (e.g. Excel) and saved as a tab-separated variable file. More detailed instructions for file preparation are provided in the PARROT documentation. One of the consequences of PARROT's internal RNN is that the provided input sequences are not required to be the same length. Before training a PARROT network, users must specify whether their application qualifies as a classification or regression task. In classification tasks, the network is trained to assign discrete class labels to each input.
For example, if one had a set of proteins where each protein localized to a specific organelle, this would lend itself to a classification task for predicting subcellular localization. For regression, the network outputs a continuous, real-number value for each input. For example, if one had a set of peptides where each sequence had an aggregation score between 0 and 1, this would lend itself to a regression task for predicting quantitative peptide aggregation. In addition to these two categories, users must also specify whether they want the PARROT network to produce per-sequence or per-residue output. Example data formats for each of these four categories are depicted in Figure 1C. Beyond this core usage, advanced users may optionally specify network hyperparameters such as the number of layers in the network, size of the hidden state vectors, learning rate, batch size, number of training epochs, and various other optional arguments (see

Methods).
In the remaining sections, we demonstrate the effectiveness of PARROT in the context of three distinct protein applications. Our goal here is to illustrate the diverse types of biological questions PARROT is capable of interrogating and to inspire readers to apply PARROT in their own research.

PARROT predicts phosphosites on par with established methods
We first benchmarked the performance of PARROT-derived networks on a commonly studied bioinformatics task: predicting phosphorylation sites in a protein sequence. We used the Phospho.ELM (P.ELM) version 9.0 [30] and PhosPhAt (PPA) version 3.0 [31,32] for training and independent validation, similar to Dou et al. [33]. P.ELM consists of literature-derived animal phosphorylation sites and PPA consists of mass spectrometry-validated phosphosites in Arabidopsis thaliana. For both of these datasets, we extracted all 19-residue windows centered around serine, threonine and tyrosine and divided each of these into phosphorylation-positive and negative sets. After filtering out similar sequences with CD-HIT [34], we then downsampled the larger phosphorylation-negative sets in order to create balanced datasets with identical numbers of phosphorylation-positive and phosphorylation-negative windows. Separate PARROT networks were trained on the serine, threonine and tyrosine windows from the P.ELM dataset (Figure 2A).
We first tested our PARROT phosphosite predictors for each of the three residues on the P.ELM dataset using 10-fold cross-validation. This involved randomly splitting each residue-specific dataset into 10 even subsets, then training on 9/10 of the data and testing on the held out 1/10 for each of the subsets. As a benchmark, we compared the performance of our PARROT networks against three established phosphosite predictors, PhosphoSVM, MusiteDeep and PHOSFER, which each rely upon different methodologies [33,35,36]. As this was a binary classification problem, we focused our analysis on sensitivity, specificity, and Matthews Correlation Coefficient (MCC) as performance metrics. We chose MCC as a performance metric because it has been shown to be more informative for binary classification tasks than the more commonly used F1 score or accuracy [37]. Overall, the PARROT networks performed better than PhosphoSVM, and at a comparable level to PHOSFER and MusiteDeep ( Figure 2B, Supplemental Table 1). Interestingly, there was variation in the relative performance of the top three methods across the three residue types, with PARROT performing best on pSer, second best on pThr, and third best on pTyr. This performance trend corresponds with the size of the training dataset available for each residue. The P.ELM cross-validation analysis also illuminated particular biases in each of the predictors. Notably, PHOSFER and MusiteDeep tended to predict with high sensitivity and low specificity, PhosphoSVM predicted with low sensitivity and high specificity ( Figure 2C-D). However, PARROT's predictions tended to be the most balanced, with comparable sensitivity and specificity across the three different residue types.
Overfitting to training data is a common problem in the field of machine learning, so to test for this we assessed the performance of our PARROT predictors on an independent test dataset. For each of the three residue types, we trained a PARROT predictor on the full P.ELM dataset and made predictions on Ser, Thr and Tyr residues in the PPA dataset. Unsurprisingly, all of the PARROT predictors performed worse on the PPA data than they did on the P.ELM cross-validation analysis; however, the PARROT predictors still performed comparably or better than the three established phosphorylation site predictors ( Figure 2B-D, Supplemental Table 2).
PARROT's comparable performance to PHOSFER on the PPA dataset is particularly notable because PHOSFER was specifically designed for the prediction of plant phosphorylation sites [35].
Ultimately, our intention behind these analyses is not to assert that our PARROT-based predictors are inherently superior to all other existing phosphorylation site predictors. Rather it is to demonstrate that PARROT, despite being a general framework for any type of protein analysis, can perform equivalently to methods that are specifically developed for a particular task. In doing so, we establish that PARROT-trained networks perform at a high level and that PARROT can confidently be extended to other less well-characterized protein applications.

PARROT can integrate into high-throughput experiment workflows
Having established that predictors trained with PARROT can accurately learn patterns in large datasets, we next focused on showcasing PARROT's ability to integrate into a typical high-throughput experiment workflow. To accomplish this, we turned to the data generated by Erijman and colleagues [5], in which the authors developed a high-throughput fluorescence assay for testing 30-mer peptides for activation domain (AD) function in yeast. Their assay measured ~37,000 sequences with AD function and~1 million without, allowing them to train a convolutional neural network to predict AD function from sequence and secondary structure information (ADpred). This general workflow of 1) generating massive quantities of data using a high-throughput assay; and 2) developing a computational predictor based on the assay data is becoming increasingly common in molecular biology. While ultimately the approach taken by Erijman et al. was computationally rigorous and successful, here we demonstrate that PARROT could readily be implemented in such a workflow without sacrificing performance ( Figure 3A).
Using PARROT in cases like this could save researchers valuable time from having to develop their own machine learning predictors from scratch.
Using 10-fold cross-validation, we trained PARROT networks on this AD dataset (see Methods) and evaluated their performance and generalizability. First, we tested how well each of the networks predicted AD function on the held-out test set. PARROT networks predicted AD function with an accuracy of 93.1% (standard error ± 0.1%) and an area under the precision-recall curve (AUPRC) of 0.973 (± 0.001), which were not significantly different from ADpred's reported accuracy and AUPRC of 93.2% (± 0.1%) and 0.975 (± 0.001), respectively ( Figure 3B). However, the PARROT-based predictors did significantly outperform the simple logistic regression model also used in Erijman et al. [5], which had an accuracy of 89.1% (± 0.4%) and AUPRC of 0.942 (± 0.002). We also assessed the generalizability of the PARROT predictors through a similar approach as in the ADpred paper. Each cross-validation-trained network was also applied to an independent yeast AD dataset from Staller et al. [38]. We found good correlation between our predicted AD values and the independent data with an average Pearson's R = 0.586 (± 0.005), which was slightly higher than the reported performance of ADpred of R = 0.57 ( Figure 3C).
To assess how the PARROT networks performed with fewer sequences to train on, we repeated both of these analyses on reduced datasets. Sampling from the complete dataset containing 75,846 30-mer peptides (50% displaying AD function), we created new 70K, 60K, 50K, 40K, 30K, 20K, 10K and 5K peptide datasets. AUPRC began to plateau around 40K peptides, and generalizability to the Staller et al. data plateaued at around 30K, indicating that PARROT can robustly capture meaningful patterns in reduced datasets ( Figure 3D).
Although all of the peptides studied in this analysis were 30 residues in length, one of the benefits of PARROT over other deep learning approaches is that it is not limited to fixed length sequences. In theory, one could train a predictor with PARROT using the combined results from multiple independent assays that test for similar phenotypes. As a proof of concept for this idea, we combined the data from Erijman et al. with the results from a similar AD functional assay that tested 5-20 residue peptides [39], trained new PARROT predictors on a variety of dataset sizes, and repeated the analyses described above. We found that 10-fold cross-validation accuracy and AUPRC slightly decreased using the combined datasets, possibly due to greater intra-dataset variance. However, performance on the independent test dataset was not significantly different (Supplemental Figure 1). Despite the modest dip in performance for this particular case, we posit that PARROT's flexibility to incorporate multiple datasets while training could be useful in other contexts where a single, comprehensive dataset is not available.
As a final set of analyses, we compared our PARROT predictor to a recently published deep learning-based method for activation domain prediction, called PADDLE, developed by Sanborn et al. [9]. Similar to ADpred, PADDLE is a deep convolutional neural network and was trained on data derived from a quantitative, high-throughput assay. When applying our PARROT predictor trained on the Erijman et al. data to the Sanborn et al. data. , we obtained relatively poor predictive power (Supplemental Figure 2). However, since ADpred had also been shown to be ineffective at predicting the data obtained by Sanborn et al., [8], we suspected that PARROT's underperformance may reflect inherent system-specific limitations in transferability between the two datasets. To test this, we leveraged PARROT's flexibility and trained a new predictor using the same training data as PADDLE. This new predictor saw substantially improved performance and was able to predict activation domain function comparably to PADDLE.

PARROT can complement DMS experiments
For our final analysis, we demonstrate a unique usage for PARROT in tandem with deep mutational scanning (DMS) experiments. We conducted our training and testing of PARROT networks using a recent DMS dataset investigating amyloid-beta (Aß42), a 42 residue peptide that can form plaques implicated in Alzheimer's disease [8,40]. In work by Seuma et al., the authors tested >450 single and >14,000 double mutants of Aß42 in an assay that measured each variant's propensity to nucleate amyloid fibrils. Each of the variants they tested was assigned a log-ratio score (normalized to WT) with positive values indicating that that variant was nucleation-prone. While this scale of this experiment was massive, the sheer combinatorics of DMS makes it infeasible to truly capture all possible single and double mutations for a peptide of this size in a single assay. In our analysis, we show that PARROT can be employed to "fill in the gaps" of a DMS experiment by training on the experimental variants and applying the network to predict the experimental outcome for variants that were not directly assayed.
We first validated PARROT's ability to predict nucleation scores from DMS data. Unlike the previous applications, the peptides obtained from DMS experiments occupy a relatively limited region of sequence space given that each sequence differs by only a few point mutations.
It was not initially clear to us if PARROT would be able to learn the general, underlying patterns within this more focused dataset rather than overfitting on specific observations. To test this, we set out to rigorously evaluate our PARROT networks by developing and applying a method of Using residue-wise cross-validation, we trained and tested PARROT networks for all 42 positions of Aß42, taking the average predictions of double mutants since they were represented in the two separate test sets. Across all of the single and double mutants in the dataset, we see good correlation between PARROT's predictions and the true assay scores (R 2 = 0.593) ( Figure   4B). To provide context for this value, between multiple biological replicates of the DMS experiments an R 2 of 0.72 was obtained, indicating to us that PARROT is effectively capturing much of the variation between sequences that is not due to biological noise [8]. Within our entire set of predictions, the correlation was tighter among the double mutants in the dataset than the single mutants, likely due to the limited information that PARROT sees for the single mutants during training (Supplemental Figure 3).
We next sought to see if the PARROT networks could capture epistatic relationships between Aß42 residues in the set of double mutants. In assays that measure complex phenotypes such as the nucleation of amyloid fibrils, it is not clear a priori if independent mutations will work synergistically or antagonistically when combined. For this analysis, we were interested in how well PARROT could predict the impact of double mutations in the DMS dataset relative to simpler estimations, such as summing the assay score of the two single mutations. Looking at only the double mutants in our dataset for which both point mutations were represented in the set of single mutants, we found that PARROT's predictions significantly outperformed this simple summing approach (p < 0.01) (Figure 4C). We also tested PARROT against other approaches for predicting double mutants: averaging the single mutant scores, taking the minimum score, or taking the maximum score, and similarly found that PARROT's predictions had significantly tighter correlation to the true values (Supplemental Figure 4). While the effect size was relatively small, it is important to note that the PARROT networks making these epistatic predictions are training without key positional information due to the residue-wise cross-validation process. PARROT is not simply integrating information from the two single mutants, but rather it is making predictions based on general patterns it has learned from other variants.
Lastly, we wanted to see if PARROT was an effective tool for prioritizing untested candidate variants for follow-up study. Since it is infeasible for DMS experiments to test all possible point mutations in the protein sequence, we reasoned that PARROT might be an effective tool for making predictions on the mutants not covered by the assay. To test this idea, we assessed how effectively PARROT prioritized a set of twelve Aß42 variants linked to familial Alzheimer's Disease (fAD) within the entire collection of single mutants. This analysis was analogous to what was performed by Seuma and colleagues in the original DMS study [8]. In addition to the predictions made by our residue-wise cross validation networks (PARROT_ResCV), we trained an additional network using PARROT on the entire DMS dataset minus the twelve fAD-linked single mutants and all double mutants containing one or both of these mutations (PARROT_nofAD). We calculated area under the ROC curve (AUROC) to evaluate the predictions of these PARROT networks, and compared PARROT's performance to the original DMS assay and to TANGO [41] and CADD [42], which are computational predictors of aggregation and variant effect respectively (Figure 4D). With the exception of the assay's scores (which PARROT trained on), PARROT_nofAD and PARROT_ResCV outperformed all other predictors. In particular, the success of the PARROT_nofAD predictor demonstrates that PARROT can effectively "fill in the gaps" of DMS experiments and help prioritize candidate variants for follow up study. Essentially, researchers can use PARROT to construct their own variant effect predictor that is specific to their assay and protein of interest.

DISCUSSION
When designing PARROT, we set out to develop a machine learning tool that effectively extracts patterns from protein sequence data, is generalizable to a wide array of regression and classification tasks, and is easy to use. There are a number of tools in recent years that satisfy some of these criteria, but not all three. For instance, deep learning-based predictors are becoming widely used in protein analysis, but these implementations tend to be designed for a single specific application rather than general use [22,23,43]. Although general protein analysis tools do exist, these typically implement simpler techniques like linear or logistic regression, support vector machines or decision trees, and are not necessarily able to identify complex, non-linear patterns in datasets [44,45]. Meanwhile, open source software packages like PyTorch, Keras, and TensorFlow make general deep learning frameworks freely available, but implementing these requires significant computational expertise and time investment. PARROT offers a freely available deep learning tool that satisfies all three of these criteria. By creating a tool that is sufficiently flexible, straight-forward, and computationally rigorous, we aim to make the advantages of deep learning accessible to all biological scientists. Importantly, we have demonstrated that predictors built using PARROT perform comparably to existing machine learning predictors across multiple contexts. In the case of phosphorylation site prediction, PHOSFER, PhosphoSVM, and MusiteDeep have all been specifically designed for this task, while PARROT was not. Nonetheless, PARROT still predicts phosphorylation sites approximately equivalently to each of these methods. Likewise, PARROT also performs comparably to both ADpred and PADDLE after training on the same dataset as either of these predictors. In our analysis of Aß42, we saw that PARROT networks trained on the DMS dataset were more effective at identifying pathogenic, fibril forming variants than computational tools like TANGO or CADD. Collectively, these results demonstrate that PARROT's flexibility across datasets does not come at the expense of performance.
The three specific applications we used to showcase PARROT outline broader use cases in which it can be effective. For starters, PARROT can be used to create predictors from existing bioinformatic datasets, like how we trained networks to predict phosphosites using P.ELM. variants that were not experimentally tested. In all three cases, PARROT can save researchers valuable time by eliminating the need to develop machine learning predictors de novo.
As a final point, we would like to emphasize to users of PARROT, or any similar tool, that predictions made by machine learning models should be interpreted with caution. Although deep learning methods are powerful at detecting patterns in data, this power also comes with increased susceptibility to overfitting and biased datasets. Proper data processing, not specific model architecture, is arguably the most critical factor for ensuring that deep learning is utilized accurately and meaningfully. While deep learning-based predictions can be instrumental in generating follow-up candidates and developing hypotheses, it is important to remember that these predictions do not replace the need for direct experimental validation.

LSTM implementation
PARROT's underlying bi-directional LSTM network is implemented using the PyTorch  Figure 1C). PARROT uses either a many-to-one or many-to-many architecture depending on whether the machine learning task at hand involves mapping protein sequences to single values (or class labels) or mapping each residue to a value/class label. The key implementation difference between these two architectures is in which hidden state vectors of the final layer of LSTM cells are input into the fully connected layer. For residue mapping, the hidden state vectors of the final forward and reverse cells at each position in the sequence are integrated into their own final connected layer (Figure 1C, gray). In contrast for sequence mapping, only the hidden state vectors from the final forward and final reverse cells are integrated into the fully connected layer (Figure 1C, green). For classification tasks, the fully-connected layer outputs a vector with a size corresponding to the number of class labels. For regression tasks, this layer outputs a single value.
During training, weights in PARROT networks are updated using the Adam optimizer [46]. By default, the initial learning rate is set at 0.001. Classification tasks employ a cross entropy loss function, while regression tasks use L1 and L2 loss functions for sequence mapping and residue mapping tasks respectively. PARROT splits input datasets 70-15-15 into training, validation, and testing datasets by default, however these proportions can be manually-specified via the "--set-fractions" argument. The validation set is not trained on, but used to assess network performance after each epoch of training. The test set is completely held out until after training has concluded, in order to give an estimate for how generalizable the trained network is on unseen data. Approximate training times for different hyperparameters and dataset sizes are listed in Supplemental Table 3. Further implementation details and information on additional run-time arguments can be found in the PARROT documentation.

Evaluation metrics
In binary classification problems, each prediction falls into one of four cases: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). We compared our PARROT networks to other predictors using a variety of performance metrics that describe distribution of predictions across each of these categories. These metrics are calculated in the following ways: Alternatively, performance on binary classification tasks can be evaluated using precision-recall or receiver operator characteristic (ROC) curves. Instead of assigning each predicted sequence a discrete class label, sequences are assigned a continuous real number value corresponding to the confidence that it belongs to a particular class. We generated these non-discrete predictions using the optional "--probabilistic-classification" command-line argument, and calculated AUPRC and AUROC using the Python package scikit-learn [47].

Phosphosite prediction
The same P.ELM and PPA datasets were used as by Dou et al. [33], each split into separate phospho-serine, -threonine, and -tyrosine subsets. Initially, sequences with >30% similarity within each subset were removed using CD-HIT with default arguments [34]. We next extracted all 19-residue windows centered around all serine, threonine and tyrosine residues in each of the respective datasets, dividing these into phosphorylation-positive and phosphorylation negative sets. A subsequent round of filtering was performed and sequences within these subsets with >20% similarity were removed. We then randomly downsampled the phosphorylation-negative sequences so that their number equaled the phosphorylation-positives, and merged the two datasets into a single file for training by PARROT.
Our analysis proceeded by training and evaluating the networks on the P.ELM dataset using 10-fold cross-validation. The pSer, pThr and pTyr datasets were each split randomly into ten equal subsets. Using the "--split" argument, PARROT networks were subsequently trained on nine of these sets and the resulting network made predictions for the sequences in the held out test set. These networks were trained using the following arguments: two hidden layers; hidden vector size of ten; learning rate of 0.0001; batch size of 64; 500 training epochs. The reported performance metrics in Figure 2 and Supplemental Tables 1 & 2 denote the average scores across the 10 cross-validation test sets. Predictions were also made by PHOSFER and MusiteDeep through their online web server on each of the cross-validation test sets and performance metrics were averaged. However, we opted not to test PhosphoSVM in this manner since this predictor was originally trained on the same P.ELM data and we wanted to avoid overfitting. Instead, we report the performance metrics taken directly from Dou et al., since these were calculated using a similar strategy of 10-fold cross-validation on the P.ELM dataset [33].
Using the same training arguments, additional networks were trained on the full P.ELM dataset (separately for pSer, pThr and pTyr) and used to make predictions on the PPA dataset.
Predictions were also made by PHOSFER, MusiteDeep on the same PPA data, and performance metrics were calculated for each of these sets of predictions. As with the P.ELM data, the performance metrics of PhosphoSVM on the PPA data were taken directly from Dou et al.

Activation domain function prediction
The quantitative fluorescence assay data of Erijman et al. was collected and processed in a manner identical to its source paper [5]. Briefly, each 30-mer was assigned a real number score based on its distribution of reads across four fluorescence expression bins. These sequences were split into AD-positive and AD-negative sets and the negative set was sampled such that there were equal numbers of positive and negative sequences in the final dataset. This sampling process was repeated five times for the "full" dataset (75,846 sequences), as well as for each of the reduced datasets (70K sequences, 60K sequences, etc.) in order to generate additional replicates.
Each dataset was split randomly into 10 cross-validation subsets, and PARROT networks were subsequently trained on 9 and tested on the held-out subset. PARROT networks were trained using the following hyperparameters: two hidden layers; hidden vector size of ten; learning rate of 0.0005; batch size of 64; 300 training epochs. Although our input data was set up as a classification task, by using the "--probabilistic-classification" argument, all of our predictions were output as real numbers between zero and one which allowed us to conduct precision-recall curve analysis. In addition to assessing the performance on the held-out test set, each network was also used to make predictions on an independent dataset. This independent dataset was obtained from a similar yeast AD quantitative fluorescence assay from Staller et al. [38]. We calculated the normalized expression value for each sequence in this dataset by dividing the raw AD activity (GFP) by the protein expression level (mCherry), and log-normalizing the data around the WT sequence. The performance metrics reported in Figure 3 are the averages of 50 total replicates (5 replicate datasets with 10-fold cross-validation for each).
We also created a combined training dataset using the results from a similar AD functional assay in Ravarani et al. [39]. We extracted all sequences from this assay that were at least five residues in length and split into positive and negative sets as described using a cut-off of -0.14. These AD-positive and negative sequences were then merged with the full Erijman et al. dataset, and PARROT networks were trained and evaluated in the same manner as before.
To perform comparisons against PADDLE [9], we extracted the activation assay data Predictions from the 42 test sets were combined, averaged (in the case of double mutants), and then analyzed.
We assessed the ability of PARROT to detect "epistasis" by comparing the network's prediction of double mutants to simpler approaches that estimated mutant effect by integrating nucleation scores of the associative single mutations. We determined statistical significance between correlations derived from these different approaches through bootstrapping. All data points were resampled with replacement 10,000 times, calculating Pearson's R for each iteration, and the 99% confidence intervals were used as a threshold for significance (p < 0.01).
The twelve fAD-linked variants that we analyzed were: H6R, D7N, D7H, E11K, K16N, A21G, E22G, E22K, E22Q, D23N, L34V and A42T. PARROT_ResCV and PARROT_nofAD predictions for all single mutants were ordered in order to create ROC curves. The CADD and TANGO predictions used for ROC analysis were also obtained from Seuma et al., as they performed an identical analysis on this set of 12 variants.

Implementation
The complete PARROT implementation consists of three command-line commands: parrot-train, parrot-predict and parrot-optimize. For the analysis described here, parrot-train was used to train the RNN predictors given a properly formatted dataset and parrot-predict was used to make predictions on new sequences using an existing trained network. We did not use parrot-optimize in these analyses, but can be used to automatically select network hyperparameters through Gaussian process optimization. More details can be found in the PARROT documentation: https://idptools-parrot.readthedocs.io/. PARROT is optimized to run in a Mac or Linux environment, but can also work using Windows.

Data availability
PARROT is fully open-source and available for download on GitHub (https://github.com/idptools/parrot) or through the PyPI library (idptools-parrot). Example networks for each of the analyses described here and code demonstrating their usage are provided on GitHub (https://github.com/holehouse-lab/supportingdata/).    [-5, 5] to randomly generated protein sequences~25-35 residues in length. Networks were trained using a NVIDIA TU116 GPU. Three replicate PARROT networks were trained for each specified set of hyperparameters and dataset.