Light attention predicts protein location from the language of life

Abstract Summary Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expert-designed input features leveraging information from multiple sequence alignments (MSAs) that is resource expensive to generate. Here, we showcased using embeddings from protein language models for competitive localization prediction without MSAs. Our lightweight deep neural network architecture used a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention. The method significantly outperformed the state-of-the-art (SOTA) for 10 localization classes by about 8 percentage points (Q10). So far, this might be the highest improvement of just embeddings over MSAs. Our new test set highlighted the limits of standard static datasets: while inviting new models, they might not suffice to claim improvements over the SOTA. Availability and implementation The novel models are available as a web-service at http://embed.protein.properties. Code needed to reproduce results is provided at https://github.com/HannesStark/protein-localization. Predictions for the human proteome are available at https://zenodo.org/record/5047020. Supplementary information Supplementary data are available at Bioinformatics Advances online.


Protein Preliminaries
Protein Sequences. Proteins are built by chaining and arbitrary number of one of 20 amino acids in a particular order. When amino acids come together to form protein sequences, they are dubbed residues. During the assembly in the cell, constrained by physiochemical forces, the one-dimensional chains of residues fold into unique 3D shapes based solely on their sequence that largely determine protein function. The ideal machine learning model would predict a protein's 3D shape and thus function from just protein sequence (the ordered chain of residues).
Protein Subcellular Location. Eukaryotic cells contain different organelles/compartments. Each organelle serves a purpose, e.g., ribosomes chain together new proteins while mitochondria synthesize ATP. Proteins are the machinery used to perform these functions, including transport in and out and communication between different organelles and a cell's environment. For some compartments, e.g., the nucleus, special stretches of amino acids, e.g., nuclear localization signals (NLS), help identifying a protein's location via simple string matching. However, for many others, the localization signal is diluted within the whole sequence, requiring sequence-level predictions. Furthermore, some organelles (and the cell itself) feature membranes with different biochemical properties than the inside or outside, requiring protein gateways.
Homology-inference. Two highly similar protein sequences will most likely fold in similar 3D structures and more likely to perform similar functions. Homology based inference (Nair & Rost, 2002;Mahlich et al., 2018), which transfers annotations of experimentally validated proteins to query protein sequences, is based on this assumption (Sander & Schneider, 1991). Practically this means searching a database of annotated protein sequences for sequences that meet both an identity threshold and a length-of-match threshold to some query protein sequence. Sequence homology delivers good results, but its stringent requirements render it applicable to only a fraction of proteins (Rost, 1999).
Machine learning Function Prediction. When moving into territory where sequence similarity is less conserved for shorter stretches of matching sequences (Mahlich et al., 2018;Rost, 2002), one can try predicting function using evolutionary information and machine learning (Goldberg et al., 2012;Almagro Armenteros et al., 2017). Evolutionary information from protein profiles, encoding a protein's evolutionary path, is obtained by aligning sequences from a protein database to a query protein sequence and computing conservation metrics at the residue level. Using profiles leads to impressively more accurate predictions for sequences with no close homologs and has been the standard for most protein prediction tasks (Urban et al., 2020), including subcellular localization (Goldberg et al., 2012;Almagro Armenteros et al., 2017;Savojardo et al., 2018). While profiles provide a strong and useful inductive bias, their information content heavily depends on a balance of the number of similar proteins (depth), the overall length of the matches (sequence coverage), the diversity of the matches (column coverage), and their generation is parameter sensitive.

Hyperparameters
The following describes the search space used to find hyperparameters of our final LA and FNN models. We performed random search over these parameters. The evaluated learning rates were in the range of [5 × 10 −6 -5 × 10 −3 ]. For the light attention architecture, we tried filter sizes [3,5,7,9,11,13,15,21] and hidden sizes d out ∈ [32,128,256,512,1024,1500,2048], as well as concatenating outputs of convolutions with different filter sizes. For the FNN, we searched over the hidden layer sizes [16,32,64,512,1024], where 32 was the optimium. We maximized batch size to fit a Quadro RTX 8000 with 48GB vRAM, resulting in the batch size of 150. Note that the memory requirement is dependent on the size of the longest sequence in a batch. In the DeepLoc dataset, the longest sequence had 13 100 residues.

Additional Results
We provide results for both setDeepLoc (Table 4) and setHARD (Table 3) in tabular form, including the Matthew's Correlation Coefficients (MCC) and class unweighted F1 score.
Furthermore, in Figure 1 we find that the UMAP projections of x are more similar to those of the attention coefficients pooled along the length dimension and show clear clusters.
Meanwhile, the projections of v max in Figure 2 are less informative even though the ablations showed that v m ax is important for the performance of our architecture.
Notable is that for both projections there are some clear outliers with the localization Plastid that are mapped far away from all other projections.

Datasets
Since ESM-1b can only process sequences shorter than 1024 residues, we removed the longer ones. This resulted in 8662 sequences for the training data, 2457 for setDeepLoc, and 431 for setHard. Table 5 shows the distribution of subcellular localization classes in the standard setDeepLoc and our new setHARD with all sequences included.  In the following, we lay out the steps taken to produce the new test set (setHARD). The starting point is a filtered UniProt search with options as selected in Figure 4. Python code used is available at data.bioembeddings.com/public/ data/new test set procedure code data.zip. •