Holographic-(V)AE: An end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space

Open Access

Holographic-(V)AE: An end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space

Gian Marco Visani, Michael N. Pun, Arman Angaji, and Armita Nourmohammad

Phys. Rev. Research 6, 023006 – Published 1 April 2024

Abstract

Group-equivariant neural networks have emerged as an efficient approach to model complex data, using generalized convolutions that respect the relevant symmetries of a system. These techniques have made advances in both the supervised learning tasks for classification and regression, and the unsupervised tasks to generate new data. However, little work has been done in leveraging the symmetry-aware expressive representations that could be extracted from these approaches. Here, we present holographic-(variational) autoencoder [H-(V)AE], a fully end-to-end SO(3)-equivariant (variational) autoencoder in Fourier space, suitable for unsupervised learning and generation of data distributed around a specified origin in 3D. H-(V)AE is trained to reconstruct the spherical Fourier encoding of data, learning in the process a low-dimensional representation of the data (i.e., a latent space) with a maximally informative rotationally invariant embedding alongside an equivariant frame describing the orientation of the data. We extensively test the performance of H-(V)AE on diverse datasets. We show that the learned latent space efficiently encodes the categorical features of spherical images. Moreover, the low-dimensional representations learned by H-VAE can be used for downstream data-scarce tasks. Specifically, we show that H-(V)AE's latent space can be used to extract compact embeddings for protein structure microenvironments, and when paired with a random forest regressor, it enables state-of-the-art predictions of protein-ligand binding affinity.

Received 11 June 2023
Accepted 22 February 2024

DOI:https://doi.org/10.1103/PhysRevResearch.6.023006

Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.

Published by the American Physical Society

Physics Subject Headings (PhySH)

Bioinformatics Biological neural networks Inverse problems

Interdisciplinary PhysicsStatistical Physics & ThermodynamicsPhysics of Living Systems

Authors & Affiliations

Gian Marco Visani ^*

Paul G. Allen School of Computer Science and Engineering, University of Washington, 85 E Stevens Way NE, Seattle, Washington 98195, USA

Michael N. Pun

Department of Physics, University of Washington, 3910 15th Avenue Northeast, Seattle, Washington 98195, USA

Arman Angaji

Institute for Biological Physics, University of Cologne, Zülpicher Str. 77, 50937 Cologne, Germany

Armita Nourmohammad ^†

Department of Physics, University of Washington, 3910 15th Avenue Northeast, Seattle, Washington 98195, USA; Paul G. Allen School of Computer Science and Engineering, University of Washington, 85 E Stevens Way NE, Seattle, Washington 98195, USA; Department of Applied Mathematics, University of Washington, 4182 W Stevens Way NE, Seattle, Washington 98105, USA; and Fred Hutchinson Cancer Center, 1241 Eastlake Ave E, Seattle, Washington 98102, USA

^*Correspondence address: gvisan01@cs.washington.edu
^†Correspondence address: armita@uw.com

Article Text

Click to Expand

Supplemental Material

Click to Expand

References

Click to Expand

Issue

Vol. 6, Iss. 2 — April - June 2024

Subject Areas

Reuse & Permissions

Author publication services for translation and copyediting assistance advertisement

Images

Figure 1
Schematic of the network architecture. (a) Schematic of a steerable tensor with $ℓ_{max} = 1$ and 4 channels per feature degree. We choose a pyramidal representation that naturally follows the expansion in size of features of higher degree. (b) Schematic of a Clebsch-Gordan block (CG bl.), with batch norm (BN), efficient tensor product (ETP), and signal norm (SN), and linear (Lin) operations. (c) Schematic of the H-AE architecture. We color code features of different degrees in the input and in the latent space for clarity. The H-VAE schematic differs only in the latent space, where two sets of invariants are learned (means and standard deviations of an isotropic Gaussian distribution).
Reuse & Permissions
Figure 2
H-VAE on MNIST-on-the-sphere. Evaluation on rotated digits for an H-VAE trained on nonrotated digits with $z = 16$ . (a) Original and reconstructed images in the canonical frame after inverse transform from Fourier space. The images are projected onto a plane. Distortions at the edges and flipping are side-effects of the projection. (b) Visualization of the latent space via 2D UMAP [27]. Data points are colored by digit identity. (c) Cherry-picked images generated by feeding the decoder invariant embeddings sampled from the prior distribution and the canonical frame. (d) Example image trajectory by linearly interpolating through the learned invariant latent space. Interpolated invariant embeddings are fed to the decoder alongside the canonical frame. MNIST-on-the-sphere dataset is created by projecting data from the planar MNIST on a discrete unit sphere, using the Driscoll-Healey (DH) method with a bandwidth (bw) of 30 [4].
Reuse & Permissions
Figure 3
Visual proof of the disentanglement in the latent space of MNIST-on-the-sphere. For each row, the invariant embedding $z$ is held fixed, and a different frame (i.e., the rotation matrix) is used. Frames are sampled randomly and differ across rows, with the exception of the first column, which is always the identity frame. Then, $z$ and the frame are fed to the decoder and the Inverse Fourier Transform is used to generate the reconstructed spherical image, which is projected onto a plane for the ease of visualization. Modulo the distortions given by projecting the image onto a plane, it is clear that the invariant embedding contains all semantic information, and the frame solely determines the orientation of the image.
Reuse & Permissions
Figure 4
H-(V)AE implicitly learns to maximally overlap training images on MNIST-on-the-sphere. For each of the four models with $z = 16$ , we train a version using only images containing 1s and 7s. For each of the resulting eight models, we visualize the sum of training images of digits 1 and 7, when rotated to the canonical frame. We compute the sums of images with the same digit, and overlay them with different colors for ease of visualization. We test the hypothesis as whether H-(V)AE learns frames that align the training images such that they maximally overlap; we do so in two ways. First, if the hypothesis were true, all canonical images of the same digit should maximally or near-maximally overlap—since they have very similar shape—and thus, their overlays would look like a “smooth” version of that digit. Indeed, we find this statement to be true for all models irrespective of their training strategy. Second, we consider the alignment of images of different digits. We take 1s and 7s as examples given their similarity in shape. If the hypothesis were true, models trained with only 1s and 7s should align canonical 1s along the long side of canonical 7s; indeed we find this to be the case for the variational models, for which the embeddings are believed to be more semantically meaningful and are more robust to noise. The same alignment between 1s and 7s, however, does not necessarily hold for models trained with all digits. This is because maximizing overlap across a set of diverse shapes does not necessarily maximize the overlap within any independent pair of such shapes. Indeed, we find that canonical 1s and canonical 7s do not overlap optimally with each other for models trained with all digits. We note that these tests do not provide a formal proof, but rather empirical evidence of the characteristics of frames learned by H-(V)AE on the MNIST-on-the-sphere task.
Reuse & Permissions
Figure 5
Structural embeddings to predict protein-ligand binding affinities with H-(V)AE. (a) H-VAE was trained to reconstruct the Fourier representation of 3D atomic point clouds representing amino acids (colors). The invariant latent space clusters by amino acid conformations. The highlighted clusters for PHE and TYR contain residue pairs with similar conformations; TYR and PHE differ by one oxygen at the end of their benzene rings. We compare conformations by plotting each residue in the standard backbone frame (right); $x$ and $y$ axes are set by the orthonormalized $C α$ -N and $C α$ -C vectors, and $z$ axis is their cross product. For this plot, 1000 amino acids were used as training data, with network parameters: $β = 0.025$ and $z = 2$ . (b) (Left) An example protein neighborhood (point cloud of atoms) of 10 Å around a central residue, used to train the H-AE models, is shown. (Right) 2D UMAP visualization of the 128-dimensional invariant latent space learned by H-AE trained on the protein structure neighborhoods with $L = 6$ can separate neighborhoods by the secondary structure of their focal amino acid (colors). A linear classifier trained on 300 000 latent embeddings predicts secondary structure of the focal amino acid with 90% accuracy; see Figs. S18 and S19 within the SM [28] for a more detailed analysis of this latent space. Each point represents a neighborhood; see Sec. pp1-s5e for details on the network architecture and training procedure. (c) We use H-AE to extract the residue-level SO(3)-invariant embeddings in the binding pocket of a protein-ligand structure complex (data from PDBbind [36]). We then sum over these embeddings to form an SE(3)-invariant pocket embedding that is used as an input to a standard machine learning model to predict the binding affinity between the protein and the ligand. (d) The predictions on the protein-ligand binding affinities from (c) is shown against the true values for the training (left) and the test (right) sets. We use the data split provided by ATOM3D [37], which devises training and test sets respectively containing 3507, and 490 protein-ligand complexes, with maximum 30% sequence similarly between training and test proteins; see Table 2 for a comparison against state-of-the-art methods.
Reuse & Permissions

Physical Review Research