AI-assisted structural consensus-proteome prediction of human monkeypox viruses isolated within a year after the 2022 multi-country outbreak

ABSTRACT The monkeypox virus (MPX) belongs to the Orthopoxvirus genus of the Poxviridae family, is endemic in parts of Africa and causes a disease in humans similar to smallpox. The most recent outbreak of MPX is already affecting 110 countries, with 86,956 confirmed cases since May 2022 and has consequently become a focus of interest. In particular, a molecular understanding of the virus is essential to study infection processes and pathogen-host interactions, predict tropism changes, or guide drug development and drug discovery as well as vaccine development or vaccine adaptation at a very early stage. Herein, we present a study of the structural proteome of the currently circulating MPX: Our consensus analysis of 3,713 genome sequences sampled within a year after the outbreak revealed 10,580 characteristic candidate open reading frames (ORFs). A search in the non-redundant protein database reduced the number of suspected ORFs to 1,079, of which 210 are representative proteins in typical MPX reference genomes. This should serve as a collection of putative proteins within the currently spreading MPX, a compound of information that could support timely drug discovery, mutational analyses, and vaccine development. We, herein, present the so far most comprehensive structural proteome by providing atomistic 3D models of 210 proteins, generated with three state-of-the-art structure prediction methods, including a mutational analysis of the proteome, with a particular focus on the drug-binding sites of tecovirimat and brincidofovir. IMPORTANCE The 2022 outbreak of the monkeypox virus already involves, by April 2023, 110 countries with 86,956 confirmed cases and 119 deaths. Understanding an emerging disease on a molecular level is essential to study infection processes and eventually guide drug discovery at an early stage. To support this, we provide the so far most comprehensive structural proteome of the monkeypox virus, which includes 210 structural models, each computed with three state-of-the-art structure prediction methods. Instead of building on a single-genome sequence, we generated our models from a consensus of 3,713 high-quality genome sequences sampled from patients within 1 year of the outbreak. Therefore, we present an average structural proteome of the currently isolated viruses, including mutational analyses with a special focus on drug-binding sites. Continuing dynamic mutation monitoring within the structural proteome presented here is essential to timely predict possible physiological changes in the evolving virus.

isolates is depicted on the y-axis, together with the lengths of the respective ORFs on the x-axis.

Figure S3 | Matching results of binding-site cavities calculated from homology models of all detected ID_6924
protein variants.The total matching score between Tecovirimat-binding-site cavities calculated from the homology models of all protein variants is depicted in this matrix.The total score is built from matching scores of cavity-properties such as electrostatics, aromaticity and point-cloud shape.The respective mutations compared to the NCBI reference protein sequence in each protein variant are labeled on the left-hand side and the top (multiple mutations in a single sequence are separated by "_") and "WT" refers to the NCBI reference protein sequence of phospholipase F13.The more similar two matched cavities are, the lower is the total score, and a score of 0 refers to identical cavities.Homology modeling, cavity-point-cloud creation and matchings were performed within the Catalophore™ Drug Solver platform.

Figure S4 | Matching results of binding-site cavities calculated from homology models of all detected ID_8713
protein variants.The total matching score between Brincidofovir-binding-site cavities calculated from the homology models of all protein variants is depicted in this matrix.The total score is built from matching scores of cavity-properties such as electrostatics, aromaticity and point-cloud shape.The respective mutations compared to the NCBI reference protein sequence in each protein variant are labeled on the left-hand side and the top (multiple mutations in a single sequence are separated by "_") and "WT" refers to the NCBI reference protein sequence of DNA polymerase E9.The more similar two matched cavities are, the lower is the total score, and a score of 0 refers to identical cavities.Homology modeling, cavity-point-cloud creation and matchings were performed within the Catalophore™ Drug Solver platform.

Figure S1 |
Figure S1 | Structural models of ORF ID_5764 and potential binding sites.A) Sequence of ID_5764 which was subjected to structure prediction.B) Structure prediction of ID_5764 by ESMFold 1 , shown as a cartoon colored in the PyMOL "rainbow" spectrum from blue (N-terminus) to red (C-terminus).The protein surface is shown in gray.A 3D point-cloud representing a potential binding pocket was calculated using the Catalophore™ platform 2 and is colored by the electrostatics of its surroundings (blue-white-red spectrum ranging from -1 to +1).C) Confidence values (plDDT) of the AlphaFold2 3 model along the sequence of ID_5764.Five models were built and ranked by their plDDT (rank 1-5), of which the one with the highest overall plDDT is depicted in D).D) Structure prediction using AlphaFold2, structural representation as in B).Three potential binding sites were detected, represented by 3D point-clouds colored by electrostatics.

Figure S2 |
Figure S2 | Number of distinct ORF variants detected within 7,023 viral genome sequences.The number of different ORF variants (equal to the number of unique mutations or mutation-combinations) detected within all viral

Figure S5 |
Figure S5 | Structural representation of the differences in electrostatics in the binding site of protein variants compared to the respective consensus ORF.The most different cavities compared to the cavity of the consensus protein variant were determined (Figure S3 and S4) and the respective difference-cavities are depicted here.Difference-point-clouds show the difference in electrostatics between the consensus variant and the variant on the right-hand side, colored from blue (-0.25) to white (0, no difference) to red (+0.25).The respective binding-site cavities used for the matching are colored by the electrostatics of their surroundings (blue-white-red spectrum ranging from -1 to +1).The respective mutated residues of each variant are shown as yellow sticks.Residues within 5 Å of the binding-site cavities are shown as black lines.

Figure S6 |
Figure S6 | Structural representation of the differences in hydrophobicity in the binding site of protein variants compared to the respective consensus ORF.The most different cavities compared to the cavity of the consensus protein variant were determined (Figure S3 and S4) and the respective difference-cavities are depicted here, showing the changes in hydrophobicity between the consensus variant and the variant on the right-hand side, colored from blue (-0.05) to white (0, no difference) to red (+0.05).The respective binding-site cavities used for the matching are colored by the hydrophobicity of their surroundings (blue-white-red spectrum ranging from -0.25 to +0.25).The respective mutated residues of each variant are shown as yellow sticks.Residues within 5 Å of the binding-site cavities are shown as black lines.Leucine at position 118 in ID_6924 is highlighted, as structural changes of this residue are responsible for the greatest difference in hydrophobicity.