Unsupervised modeling of mutational landscapes of adeno-associated viruses viability

Adeno-associated viruses 2 (AAV2) are minute viruses renowned for their capacity to infect human cells and akin organisms. They have recently emerged as prominent candidates in the field of gene therapy, primarily attributed to their inherent non-pathogenic nature in humans and the safety associated with their manipulation. The efficacy of AAV2 as gene therapy vectors hinges on their ability to infiltrate host cells, a phenomenon reliant on their competence to construct a capsid capable of breaching the nucleus of the target cell. To enhance their infection potential, researchers have extensively scrutinized various combinatorial libraries by introducing mutations into the capsid, aiming to boost their effectiveness. The emergence of high-throughput experimental techniques, like deep mutational scanning (DMS), has made it feasible to experimentally assess the fitness of these libraries for their intended purpose. Notably, machine learning is starting to demonstrate its potential in addressing predictions within the mutational landscape from sequence data. In this context, we introduce a biophysically-inspired model designed to predict the viability of genetic variants in DMS experiments. This model is tailored to a specific segment of the CAP region within AAV2’s capsid protein. To evaluate its effectiveness, we conduct model training with diverse datasets, each tailored to explore different aspects of the mutational landscape influenced by the selection process. Our assessment of the biophysical model centers on two primary objectives: (i) providing quantitative forecasts for the log-selectivity of variants and (ii) deploying it as a binary classifier to categorize sequences into viable and non-viable classes.

1.With h i (a) being the score of aminoacid a at site i we can define for each site i the Boltzmann distribution p i (a) = e h i (a)/T Z and sample from it one aminoacid per site.
The temperature parameter T controls the "bias" induced by the single-site score.
2. For each site an amino acid is sampled with uniform probability if its score exceeds some chosen threshold.
3. Create variants by randomly performing multiple single-site edits (mutations and insertions) and accept them if the sum of the scores for each edit exceeds some chosen threshold.
In addition to the score-designed variants, a collection of random variants was sampled at distances ranging from 2 to 10 from the wild-type sequence (tested in experiment-2).
One of their objectives was to investigate how the composition and size of the training set provided to the deep learning models affect their generative capabilities.For this purpose, they created different training sets based on the sampling methodologies described and the distances of the variants from the wild-type sequence.However, our focus was on exploring how the inclusion of a biophysical model could enhance the quality of inference compared to solely feeding sequences into a neural network and addressing a regression problem.Therefore, we consolidated all the sampled sequences into a single training set for each experiment.
After training their machine learning models, the researchers proceeded to generate new enhanced capsids using the following procedure.Initially, they sampled approximately two billion random sequences within a distance range of 5 to 25 from the wild-type sequence.Subsequently, these random sequences were scored and ranked using the trained model, and the top 1000 sequences at each distance were selected as seed sequences for the generation process.Furthermore, the top 100 sequences were subjected to experimental testing.

A.2 Sequence encoding
To create diversity in the dataset, sequences have been mutated by substituting the residue present at one site with another one, or by inserting a new residue between two positions.No more than one residue is inserted between two sites.That being said, the mutated sequences have variable lengths ranging from a minimum of 28 (the length of the WT sequence where no insertion has been introduced) to a maximum of 57 (the case in which an insertion has been interposed between each pair of neighbor sites, see Fig. A1 for more details.To provide fixed-length input data to the neural network we have chosen the following encoding for sequences.We use a gap symbol to pad variants up to the maximum length that can be achieved by the sequences with the peculiar pattern of this dataset (insertion-mutation-insertion-mutation-...-insertion), i.e. 57 as mentioned before.In principle, the side of the sequences to pad would be arbitrary.Still, since sequences pass through two regions, one buried inside the more conserved capsid, and another one more exposed which is more variable, we decide to put the padding on the side which is more variable since the more conserved sites are more plausible to be aligned.
In this choice we are not consistent with [16] because they use an encoding that is equivalent to the following scheme: for even positions put the letter symbol present (WT letter or a mutation), while for odd symbols put a letter symbol if that insertion is performed otherwise put a gap if no insertion is present.We have chosen not to use this latter because if we consider for instance the sequence of letters A − B C D, it would produce exactly the same sequence given by A B C − D by they would have different encodings and therefore they would be different according to the model.The sequences then are fed to the network as one-hot arrays.
A.3 Architecture of the sequence to energy mapping neural network Starting from these selected seed sequences, a random set of 250 mutations/insertions was performed.These mutated sequences were then scored and ranked once again using the trained models.The 50 sequences with the highest scores were chosen for the next iteration of mutations/insertions.This iterative process was repeated for a total of 20 rounds, and at each distance, a fixed number of the best sequences were selected.The optimal sequences were selected within a distance range of 5 to 29.
All of these sequences, which were experimentally tested in experiment-3, were utilized by us to evaluate the generalization power of the machine learning model.It is worth noting that this seed and generated sequences differ significantly from the ones examined in experiments 1 and 2, as the latter were primarily sampled around the wild-type sequence.

Sequence encoding
To create diversity in the dataset, sequences have been mutated by substituting the residue present at one site with another one, or by inserting a new residue between two positions.No more than one residue is inserted between two sites.That being said, the mutated sequences have variable lengths ranging from a minimum of 28 (the length of the WT sequence where no insertion has been introduced) to a maximum of 57 (the case in which an insertion has been interposed between each pair of neighbor sites, see Fig. S1 for more details.To provide fixed-length input data to the neural network we have chosen the following encoding for sequences.We use a gap symbol to pad variants up to the maximum length that can be achieved by the sequences with the peculiar pattern of this dataset (insertion-mutation-insertion-mutation-...-insertion), i.e. 57 as mentioned before.In principle, the side of the sequences to pad would be arbitrary.Still, since sequences pass through two regions, one buried inside the more conserved capsid, and another one more exposed which is more variable, we decide to put the padding on the side which is more variable since the more conserved sites are more plausible to be aligned.
In this choice we are not consistent with [1] because they use an encoding that is equivalent to the following scheme: for even positions put the letter symbol present (WT letter or a mutation), while for odd symbols put a letter symbol if that insertion is performed otherwise put a gap if no insertion is present.We have chosen not to use this latter because if we consider for instance the sequence of letters A − B C D, it would produce exactly the same sequence given by A B C − D by they would have different encodings and therefore they would be different according to the model.
The sequences then are fed to the network as one-hot arrays.
3 Architecture of the sequence to energy mapping neural network  Since the energy function of our biophysical models needs to be a number we need to add a further operation to the last layer.To maintain compatibility with the model used in [1] we perform the difference between the output of perceptrons in the last layer.
1. difference the output of the last two perceptrons  method.For completeness, we report here the same analysis where instead of using the distance from WT as a metric we use the average distance from the training sequences.
Fig. A1: Empirical distributions of the distance from the WT sequence of the three data sets experiment-1,2,3.

Fig. S2 :
Fig. S2: Schematic representation of the neural network used in this work.This structure is used both for the standard machine learning approach and for modeling the energy function of the biophysical model.See Sec. 3 for details on the structure.

Fig. S3 :
Fig. S3: Alternative robustness analysis with distance.The data set has been divided into slices according to their average distance from training sequences and the models have been tested on those slices.The score for the two models is reported as a function of the distance of the test data.(a) The models are trained on experiment-1 and tested on experiment-2.(b) The models are trained on experiments 1 and 2 and tested on experiment-3.(c) The model is trained on experiment-1 and tested on experiment-2.(d) The model is trained on experiments 1 and 2 and tested on experiment-3.