Equivariant Tensor Network Potentials

Machine-learning interatomic potentials (MLIPs) have made a significant contribution to the recent progress in the fields of computational materials and chemistry due to the MLIPs' ability of accurately approximating energy landscapes of quantum-mechanical models while being orders of magnitude more computationally efficient. However, the computational cost and number of parameters of many state-of-the-art MLIPs increases exponentially with the number of atomic features. Tensor (non-neural) networks, based on low-rank representations of high-dimensional tensors, have been a way to reduce the number of parameters in approximating multidimensional functions, however, it is often not easy to encode the model symmetries into them. In this work we develop a formalism for rank-efficient equivariant tensor networks (ETNs), i.e. tensor networks that remain invariant under actions of SO(3) upon contraction. All the key algorithms of tensor networks like orthogonalization of cores and DMRG-based algorithms carry over to our equivariant case. Moreover, we show that many elements of modern neural network architectures like message passing, pulling, or attention mechanisms, can in some form be implemented into the ETNs. Based on ETNs, we develop a new class of polynomial-based MLIPs that demonstrate superior performance over existing MLIPs for multicomponent systems.


Introduction 1.Motivation
Machine learning has reached a level of maturity that has spawned technology in many fields of science now accessible to non-experts.Prominent examples are DALL-E in computer vision, AlphaGo in board games, or ChatGPT in linguistics.In atomistic modeling, machine-learning models, or, in the field-specific language, machine-learning interatomic potentials (MLIPs), have been developed to approximate the energy landscape of an underlying quantummechanical model, and reductions in computational cost of several orders of magnitude have been reported, while preserving the accuracy of quantum mechanics.The three major classes of MLIPs are neural-network-based potentials [1]- [7], polynomial potentials [8]- [10], and Gaussian process regression-based potentials [11]- [13].It is believed, and the lack of successful counterexamples is a good proof of it, that rotational and permutational (i.e., with respect to permuting chemically equivalent atoms) must be incorporated in the MLIP's functional form.For neural networks it has traditionally been achieved by using invariant input features [1]- [3], [5], [6], which was later generalized to covariant features and equivariant neural-network feature transformations [7].The state-of-the-art polynomial models also work with covariant features and their algebraic transformations (addition and multiplication) in an equivariant manner [8]- [10].Gaussian process-based models use invariant kernels [11] or basis functions [12], [13].A very interesting case is [14] which is both, a neural network model whose output depends polynomially on the input features.
A recent benchmark study [15] has been conducted to compare the accuracy-vs-efficiency performance of different models on unary (i.e., with one chemical component) systems evaluated on a CPU.This study, at least in its smalldataset, a polynomial model, MTP [9], showed a comparatively favorable performance.Despite this and other success stories, polynomial-based models suffer from exponentially growing complexity when the number of features is large.In other words, neural-network potentials regain their advantage when applied to systems with complex quantummechanical phenomena (e.g., magnetism or electronic temperature) or a large chemical space (e.g., high-entropy alloys) since the latter models have a more controlled growth of the parameter space as the number of features increases [16]- [19].
The problem of a large parameter space can be described as follows.Polynomial MLIPs are potentials whose functional form is constructed by taking polynomials of atomic features (positions, atomic species, magnetic moments, etc.).Polynomial MLIPs can be encoded with tensors, i.e., their functional form for a d-th order polynomial usually consists of (linear) combinations of scalar-valued tensor contractions of the following type (assuming the Einstein summation convention) where T i 1 ...i d is the polynomial coefficient (parameter) tensor and v i 1 . . .v i d are the feature vectors.The feature vectors themselves are multi-index vectors v i = v (j 1 ...j n ) = F j 1 ...j n , where F is the feature tensor in which each dimension represents one atomic feature.Evidently, the representation (1) becomes increasingly inefficient with an increasing number of features n because the size of the feature vector is proportional to mn , where m is the average size over all dimensions of the feature tensor, and, so, contracting T d-times with v is proportional to mdn .Even with only radial, angular, and species features, the amount of parameters can quickly increase to the order of one thousand for alloys with a large number of components, such as high-entropy alloys.
One approach to attempt reducing the amount of free parameters is to convert the coefficient tensor in (1) into a low-rank tensor format that approximates multivariate polynomials, e.g., the tensor train (TT) format [20], [21].Incorporation of symmetries into tensor networks if often not easy.For example, in [22] a tensor train representation was used to approximate on-lattice atomic energies as functions of chemical degrees of freedom, and permutational symmetry was build in by explicitly symmetrizing the tensor over the corresponding point group of the lattice.In the present setting, we require that group actions of SO(3) on atomic positions must leave (1) invariant; physically speaking, the (off-lattice) potential energy of a configuration of atoms must not change under pure rotations.A related very recent development approaches this problem by utilizing a symmetric canonical tensor decomposition of the coefficient tensor [23].

Outline of the present work
In this work we propose equivariant tensor networks (ETNs), a formalism for constructing interatomic interaction models based on tensor networks that are invariant under the action of SO (3).In particular, we develop a tensor network algebra (Section 2) and use it to develop the following three building blocks that compose our ETNs (Section 3): • covariant vectors v, i.e., vectors that rotate correspondingly with a basis change under actions of SO(3), • equivariant order-3 tensors T , i.e., tensors that describe equivariant maps of the covariant vectors, and • feature contractions, i.e., parameterized contractions that map the feature tensors F i 1 ...i n to some reduced covariant vector v.
Here and in what follows, the term "SO(3)-covariant vector" will be used in the context of a "feature vector" and refers to a collection, possibly large, of numbers that change (or remain constant) upon rotation, rather than specifically an element of R 3 .We show that arbitrary ETNs expressing polynomials of arbitrary degree can be constructed entirely from these three building blocks.We present the formalism in the language conventional to the field of numerical linear algebra, although we emphasize that many key ideas already exist in the field of quantum physics that have been used to, e.g., incorporate parity symmetries into tensor networks representing the wave functions of bosons and fermions (e.g., [24]).In particular, the recent developments [25] concern SU(2)-equivariant tensor networks [26] representing quantum states of qubit systems that recently receive a lot of attention in the context of quantum computing.In these tensor networks, SU (2) symmetries are incorporated using the Wigner-Eckhart Theorem that states that any spherical tensor T with three multi-indices {(ℓ i , m ℓ i , n ℓ i )} i=1,..., 3 , where ℓ is the angular momentum, and m is the phase, can be factorized to T (ℓ 1 ,m ℓ 1 ,n ℓ 1 )(ℓ 2 ,m ℓ 2 ,n ℓ 2 )(ℓ 3 ,m ℓ 3 ,n ℓ 3 ) = P (ℓ 1 ,n ℓ 1 )(ℓ 2 ,n ℓ 2 )(ℓ 3 ,n ℓ 3 ) Q (ℓ 1 ,m ℓ 1 )(ℓ 2 ,m ℓ 2 )(ℓ 3 ,m ℓ 3 ) , where P contains the degrees of freedom and is independent of the phase m ℓ i , 1 and Q is the Clebsch-Gordan coefficient, a constant tensor that is defined by the symmetry group.Key algorithms like those based on DMRG (density matrix renormalization group) are formulated for such tensor networks [25].
In our work, we make full use of the structure of SO(3)-equivariance, namely, the possibility of building the irreducible representation based on real-valued spherical harmonics.In the real-valued case, vector contraction and dot product are the same, which enables us to build them using the Wigner 3-j coefficients.The latter are symmetric with respect to permutations of (ℓ i , m ℓ i ) which results into undirected networks (they are directed in the SU(2) case due to the lack of symmetry of the Clebsch-Gordan coefficients, which results into network nodes of different types based on the direction of the adjacent edges).This allows us to formulate the usual operations of tensor network algebra, like reshaping an order-3 tensor core into an order-2 tensor, combining two cores together, or performing decompositions and splitting of the cores, while preserving the symmetric structure of the network.Furthermore, in our framework it is easy to also separate the reflection symmetric/antisymmetric features, resulting into the full O(3)-equivariance of the tensor network.
We exemplify the construction of ETNs by developing an equivariant version of the TT format We will show that the number of parameters of this representation is proportional to (d + n) mr 2 , where r is the average rank over all T 's.Hence, if there is enough similarity between the features, the ranks can be made small and the representation (3) will be more efficient than the "raw" polynomial (1).We also show how algorithms developed for conventional tensor networks, e.g., for optimizing tensor parameters, tensor orthogonalization, etc., carry over to the equivariant case (Section 4).
With our ETN formalism, we develop a new class of MLIPs that we henceforth refer to as the class of ETN potentials (Section 5).In particular, we explicitly develop and implement an ETN potential that has atomic position and species features, and show that this ETN potential requires significantly fewer parameters than MTPs to reach the same level of accuracy for QM7, QM9, and several datasets for metallic alloys (Section 6).
Further, we outline how our formalism can be used to construct more general ETN potentials (Section 7).Examples include extensions to arbitrary ETN topologies, more general feature tensors, nonlocal atomic energies, and other symmetry groups.
The key advantage of our new formalism is its modularity due to the rather small set of required operations; in essence, tensor operations on up-to-order-3 tensors.This heavily simplifies generic implementations and the automation of operations other than contractions to be performed on the ETNs, such as automatic differentiation, factorization, etc.Our formalism for constructing ETN potentials can thus be, in some sense, considered as an interatomic potential modular building system ("interatomic potential Lego") that allows for a rapid construction of efficient representations of polynomial MLIPs with arbitrary order and physical complexity.

A Short Introduction to Tensor Networks
Here we introduce the notions and notations needed to present equivariant tensor networks.We do not do it with maximal generality for the sake of conciseness of the presentation.Nevertheless, many features present in the conventional tensor networks are also present in the equivariant tensor networks; we remark on this in Section 7.
A tensor is a multidimensional array T ⟨d⟩ ∈ R N 1 ×N 2 ×...×N d , where d will be called the number of dimensions or simply the order of the tensor (we reserve the term rank to speak about, e.g., the rank of a matrix, or its generalization to tensors).Since we will mostly work with tensors up to the order of three, we use a notation using underscores for those tensors, namely, v ⟨1⟩ = v for an order-1 tensor (or vector), U ⟨2⟩ = U for an order-2 tensor (or matrix), and T ⟨3⟩ = T for an order-3 tensor.We will also use the index notation where, e.g., v, U , and T , are written as v i , U ij , and T ijk .For tensor contractions we will generally adopt the Einstein notation.In case we do not sum over all duplicate indices we will specify the contracted indices explicitly.

Three is the magic tensor order
A vector u can be used to model a dependence u i on one discrete variable i, or one continuous variable upon choosing a (finite) basis v i and expanding the one-dimensional function in this basis with coefficients u i .The latter motivates us to consider a contraction α u (v) = u • v corresponding to u and interpret it as a linear form over v.This way, a tensor of order d, T ⟨d⟩ , can be thought of describing a quantity dependening on d variables and can be associated with a multilinear form Of course, the simplest tensor is a scalar (zero-order tensor), but it is hardly sufficient to model anything of nontrivial complexity.A vector v can be used to model a dependence on one variable, while an outer product of two vectors is an order-two tensor u ⊗ v, however, it can only describe a two-dimensional function whose two variables are essentially independent of each other-again, we can conclude that a vector is too simple to describe complex dependencies.Of course, it is known that any dependence can be represented as a sum of u k ⊗ v k , however, this brings us to contractions of higher-order tensors.
Namely, the sum u k ⊗ v k is more conveniently interpreted as a product of two matrices U V -which is a matrix (order-two tensor) itself.Here, by writing AB we mean that the last dimension(s) of A is contracted with the first dimension(s) of B. By taking contractions of matrices one can only obtain tensors of order two or less which is, again, not sufficient to describe nontrivial multivariate dependencies.
Order-three tensors make a difference.With two order-three tensors A and B by contracting one dimension in them one can obtain a nontrivial (in the sense defined above) order-four tensor A B; with three tensors one can obtain a nontrivial order-five tensor A B C, etc.Thus, three is the minimal tensor order which allows representing nontrivial dependencies on any number of variables-in this sense we say that three is the magic number for the tensor order.

Tensor Diagrams
Tensorial operations on vectors and matrices can easily be represented with formulas (as a single string of products of multiple vectors and matrices), however, when the tensor order goes above two, it is much easier to represent the operations with diagrams.In Figure 1 we show the basic tensors and operations.The tensors are represented with boxes, their dimensions are represented with links, and the diagram represents the result of contractions of tensors with themselves.If two tensor dimensions are connected, it means that they are contracted in the result.If there are no "hanging" links then the result is scalar.
(a) Vector , matrix and order-3 tensor can be distinguished by the number of links (=dimensions) attached to it (b) Tensorial contractions, from simple like vector product (left) to more complex (right) can be illustrated by linking tensors, which corresponds to contracting over those dimensions (c) The identity matrix (left) is illustrated as a simple link, because, e.g., linking vectors through the scalar product and directly (middle), or through multiplying by the identity matrix (right) gives the same result (d) We also need the "identity order-3 tensor" (left) which can be used to express pointwise product, e.g., (right) As follows from the above, a multilinear form can be realized by contractions with multiple order-two and orderthree tensors: where T 1 and T d are order-2 tensors and T 2 , . . ., T d−1 are order-3 tensors.Thus, the right-hand side of ( 5) is a product of two vectors and d − 2 matrices yielding a scalar.The multilinear form ( 5) is nothing but the tensor train (TT) representation [20] of the associated order-d tensor which has been known in physics under the term matrix product state (MPS, see, e.g., [27], [28]).The graphical representation of (5) in terms of tensor network diagrams is In these diagrams, a tensor is represented with a square block, with connections between blocks being contractions over the tensors' dimensions.Our notation of tensor network diagrams is inspired by tensor network diagrams used in quantum physics (e.g., [29]) and appears to us very convenient for representing tensors (and tensor operations) in our setting, in particular, in view of comparing different variants of equivariant tensor networks later on.To show the power of this notation, we remark that the order-d tensor T ⟨d⟩ corresponding to (5) is not easy to write down in mathematical formulas, but very easy to represent as a visual diagram: From this diagram one can see that an order-three tensor (like T 2 ) is a fundamental building block which allows to construct tensors of any order, which reiterates the point made in the previous subsection.

SO(3)-Equivariant Tensor Networks
Our motivation is to use (4) to approximate O(3)-invariant functionals (e.g., interatomic interaction energies) of local atomic environments (or, more generally, clouds of points).O(3) invariance means that, if the atomic environments are rotated or reflected in the three-dimensional space, then the functional α does not change.We separate the problem of O(3) invariance into SO(3) (rotation) invariance and reflection invariance.Mathematically, invariance of (4) with respect to a group G (for example, G = SO(3)), means that when v changes under a group action (in our example, rotation) v → g • v, then α remains constant.Note that the model parameters, T ⟨d⟩ , are not allowed to change with the group action (because the interatomic interaction model is independent of the frame of reference).Often, together with the invariance comes the notion of equivariance.For instance, for (5) to be invariant, the matrix T does not only have to be SO(3)-invariant component-wise, but also SO(3)-equivariant, i.e., g −1 • T (g • v) = const as v rotates.We note that the equivariance of the operator T is equivalent to the invariance of the quadratic form associated with T , i.e., (T (g We are hence interested in two kinds of objects: vectors (and sometimes tensors) rotating under the action of G, and tensors that remain constant but otherwise describe an equivariant transformation of the former vectors.The first kind of objects are typically the input features of a model (with an arbitrarily chosen frame of reference) or transformed features occurring in the middle of a calculation, while the second kind are model coefficients that "do not know" which frame of reference was chosen for the input.The first kind of objects are covariant vectors, i.e., vectors that change under roations as v ′ = D T v, where D is a Wigner D-matrix, while the second kind of objects will be called equivariant tensors.In what follows, by "contraction" of tensors with anything else we mean invariant or equivariant contractions, depending on the context, unless it is explicitly mentioned otherwise.
Following the results of representation theory, for a covariant vector v, we consider its decomposition into an irreducible representation.In the case of SO(3), this corresponds to spherical harmonics components with different angular momenta.Hence, we consider v as a multi-index vector v (ℓmn) , ℓ = 0, . . ., L, where L is the maximal angular momentum, m ∈ {−ℓ, −ℓ + 1, . . ., ℓ} is the phase, and n = 1, . . ., N (ℓ) is the number of the components corresponding to ℓ.We remark that we deviate intentionally from the notation v (nℓm) commonly used in quantum physics that puts the index n, usually referred to as "principal quantum number", in front of ℓm (cf., e.g., [30]).Our reasoning in doing so is due to the fact that we will later on work with higher-dimensional tensors, where each multi-index may depend on a different ℓ.Since m and n are always assumed to depend on ℓ, using the order ℓmn appears more convenient to us for the sake of clarity.In the spirit of machine learning, we will call N (ℓ) as the number of angular momentum channels corresponding to each ℓ.We assume that the action of SO(3) on v (ℓmn) follows the action of SO(3) on the real spherical harmonic (RSH) function Ẏℓm defined as follows where the Y ℓm 's are the (complex) spherical harmonics.Using (8), we introduce the mapping U ℓmm ′ : Y ℓm ′ → Ẏℓm [31], i.e., Since U ℓ is a unitary matrix, its inverse is its conjugate transpose U −1 ℓ = U T * ℓ .We then define the Wigner 3-j symbols (or just 3-j symbols) for RSHs as The 3-j symbols are nonzero only when the three ℓ's and m's satisfy the usual two selection rules where TR(ℓ 1 , ℓ 2 , ℓ 3 ) is the triangular inequality.The 3-j symbols for RSHs can also be nonzero for other linear combinations of m's.More precisely, for the 3-j symbols for RSHs, the selection rule [SR2] is replaced by the following two selection rules where sgn(•) is the sign function, and the + and − hold for even or odd ℓ 1 + ℓ 2 + ℓ 3 , respectively, and The selection rule [SR4] holds true if (at least) one of the (sub-)selection rules [SR4.1]-[SR4.4]holds true.
The selection rules [SR3] and [SR4] follow from the properties of U T * ℓ (see Appendix B.1).Although the additional selection rules suggest that the 3-j symbol for RSHs is much less sparse than the usual 3-j symbol, it is, in fact almost as sparse as the usual 3-j symbol and does not sacrifice computational efficiency (of tensor contractions) for convenience (undirected tensor networks) as shown in Appendix B.2.
Moreover, we refer to Appendix B.3 for the equivalence of the 3-j symbol for RSHs and the Clebsch-Gordan coefficients in adding angular momenta.However, we point out that the 3-j symbols for RSHs obey almost the same symmetries as the usual 3-j symbol (cf., Appendix B.4).
In addition, we remark that the 3-j symbol for RSHs (10) is real whenever ℓ 1 + ℓ 2 + ℓ 3 is even, and imaginary otherwise (cf., Appendix B.1 for further details).This feature of the 3-j symbol for RSHs will be important for the full O(3) (reflection + rotation) invariance.
With our definition of the 3-j symbol for RSHs, we now define the basic operations to be performed on covariant vectors that we require to build equivariant tensor networks.
Remark 1.At this point, we remark that the choice of the basis is not unique.One could likewise choose invariant polynomials (invariant with respect to rotation and reflection) as the basis set [9].However, within the tensor network formalism we propose here, the operators that generate those invariant polynomials are not available off-the-shelf (such as for spherical harmonics).It could be nevertheless intersting to explore other basis sets that would lead to potentially more sparse representations.

Order-1 Tensors (Vectors)
The basic operations we require are the contraction of two covariant vectors and equivariant contractions of a vector c with a covariant vector.The latter can only be performed with respect to the 0-th angular momenta such that In other words, ( 16) is the most general equivariant contraction.

Order-2 Tensors (Matrices)
Instead of defining the outer product of two covariant vectors as simply u (ℓmn) v (ℓ ′ m ′ n ′ ) , we instead reinterpret it as a covariant vector Here, the index on the left-hand side should be understood as follows: for each n ′ , n ′′ and a pair of ℓ ′ and ℓ ′′ satisfying TR(ℓ ′ , ℓ ′′ , ℓ) we have one angular momentum channel n = (ℓ ′ ℓ ′′ n ′ n ′′ ).Altogether, for each ℓ, the outer product has ) angular momenta channels.We chose the 3-j symbols rather than the Clebsch-Gordan coefficients to emphasize their symmetry with respect to permuting the tensorial dimensions.
We remark that we use the notation (17) instead of, e.g., the one used in quantum physics (cf., bipolar spherical harmonics [32]), implying that this is nonzero only when TR(ℓ ′ , ℓ ′′ , ℓ), because we would like that the number of channels is explicitly featured in the notation.
An equivariant contraction of an arbitrary order-2 tensor C and two covariant vectors is then given by Note that we write C ℓnn ′ with three indices, and, so, it appears to be an order-3 tensor-but we use it in the following as a shorthand notation for a block-diagonal order-2 tensor C (ℓn)(ℓn ′ ) , with blocks C ℓ corresponding to different ℓ being arbitrary, regular N (ℓ) × N ′ (ℓ) matrices.Hence, many operations that can be done on matrices, such as the QR-or SVD-decomposition, can be done in a block-wise fashion, as shown in Section 4. The shortest (but not necessarily simplest) way to prove (19) is to consider the outer product of u and v (17 and then using the arbitrary vectorial contraction (16) noting that the scaling (−1) ℓ √ 2ℓ+1 can be adsorbed into C ℓnn ′ .

Order-3 Tensors (where the Magic Happens)
In the previous subsection we have already encountered a special case of an order-3 equivariant tensor-the 3-j symbol for RSHs.If we consider an arbitrary order-3 tensor T given by a trilinear form and require that α stays invariant as SO(3) acts simultaneously on u, v, and w, we will arrive at the following general representation of T , where the coefficient tensor C can be arbitrary.Due to the properties of the 3-j symbols for RSHs, can be defined as nonzero only when the triangular inequality ( 11) is satisfied.
The bilinear and linear forms can be obtained by simply substituting (ℓ i m i n i ) = (0, 0, 0) for one or two indices of the above equations.In particular, a bilinear form can be written with T in the following manner For completeness, linear forms can be written with T as follows In the following, we tacitly refer to (and also T ) as equivariant tensors without explicitly mentioning the dimension.
It is known that equivariant contractions of higher-order (higher than three) covariant vectors can be expressed through products of the 3-j (or Clebsch-Gordan) coefficients, so we do not have to explicitly consider higher-order equivariant tensors.This turns into a very interesting coincidence: similarly to how order-3 tensors were sufficient to generate a tensor of any order by repeatedly contracting the order-3 tensors, a third-order equivariant tensor contain the needed information to encode equivariance in any higher-order tensors.The algorithmic combination of these two facts is the core of this work ; everything else can be thought of as a corollary of this.

Construction of Equivariant Tensor Networks
Using our definition of equivariant tensors, we now exemplify the construction of equivariant tensor networks (ETNs) using the tensor train format [20].We emphasize that, in principle, any topology of tensor networks tensors can be realized also with the equivariant tensors, including the already mentioned hierarchical Tucker decomposition [33], the ring tensor format [34], or more advanced networks like PEPs [35].

Equivariant Tensor Trains
In the tensor train format, a multilinear form is a product of equivariant tensors with covariant vectors, leading to the representation We denote this representation as the equivariant tensor train (ETT) representation of a contraction of a general (d-dimensional) equivariant tensor with d covariant input feature vectors.Within the ETT representation, every coefficient tensor C i corresponding to an equivariant tensor ) angular momenta channels.As a boundary condition for α being a scalar, we require that N rank ett (0, 0) = N rank ett (d + 1, 0) = 1.The invariance of α with respect to SO(3) follows from the properties of the equivariant tensors as discussed in the previous sections.

Invariance under O(3)
For our application of interatomic potentials, we also require reflection invariance of the multilinear forms α (29).To that end, we recall that, under space inversions Pv (ℓmn) ( r) = v (ℓmn) (− r), we have To ensure the invariance of α with respect to O(3), we require, according to the previous section, that ℓ Recalling the definition of the 3-j symbol for RSHs, this immediately implies that α is always a real quantity since, in this case, the sum over all ℓ's is obviously also even.This also implies that we can neglect imaginary parts for ETTs with <3 cores since we only need to consider even ℓ 1 + ℓ ′ 2 + ℓ 2 .For >3 cores, we need to keep track of both real and imaginary parts.Algorithms 1 and 2 are possible implementations of these two cases (therein, a tensor with superscripted "re" denotes its real part, whereas a tensor with superscripted "im" denotes its imaginary part).
Algorithm 1: ETT contraction with three or less cores Output: u 0,re 0 Algorithm 2: ETT contraction with four or more cores Output: u 0,re 0 There is another way to understand the real/imaginary splitting of reflection symmetric/antisymmetric components by simply noting that for the conventional complex-valued harmonics, written in the x,y,z coordinates (such that x 2 + y 2 + z 2 = 1), it holds that Y ℓm (x, y, z) = (Y ℓm (x, −y, z)) * , tracking which through our definitions of RSHs and the 3-j symbols yields an alternative proof.

Equivariant Tensor Network Numerical Algebra
The power of conventional tensor networks lies in the rich family of algorithms working with those.In this section we show that all the key algorithms manipulating with conventional tensor networks can be adapted to work with the equivariant tensor networks as well.

Trivial algorithms: gradient descent and alternating least squares
Although having a nontrivial structure, the evaluation of tensor networks reduces to sequential multiplication and addition of numbers, which can be effectively differentiated with respect to tensor core parameters using back propagationhence, the gradient descent algorithm of optimizing tensor parameters (weights) can trivially (from the mathematical point of view) be formulated.A nontrivial generalization would be Riemannian gradient descent (i.e., the one with Riemannian gradient), but we leave this to future work.
Another trivially transferable algorithm is alternating least squares (ALS)-the tensor depends linearly on each of the cores C (ℓ 1 n 1 )(ℓ 2 n 2 )(ℓ 3 n 3 ) and they can be independently optimized, each time solving a simple quadratic optimization problem.
We next come to more advanced algorithms, generalization of which to the equivariant case is less trivial.

QR or SVD normalization of cores
In the course of optimization, it is beneficial to orthogonalize the cores to avoid, e.g., that the weights unnecessarily grow during the optimization.

Order-2 cores
To that end, we consider two order-2 cores (cf., Figure 2) where δ is the Kronecker delta symbol.Notice that we have explicitly kept all three indices (ℓmn) for each tensor dimension and used the Kronecker delta to indicate which indices are effectively equal.Assuming these tensors are adjacent in the tensor network, they form another order-2 product What we see inside the sum is, effectively, a product of two block-diagonal matrices where different values of ℓ comprise independent blocks.We therefore have that for each ℓ.Orthogonalization is thus simple: we can, e.g., get a QR decomposition for for each ℓ.The new A ′ and B ′ make up the same product C as the old ones, but now the rows of A are orthogonalized.Likewise, for each ℓ we can apply the SVD (singular value decomposition) In this case both, rows of A ′ and columns of B ′ , will be orthogonal.

Order-3 cores
We now consider two adjacent order-3 cores.This is a harder case and requires some preparation.
Consider an order-3 tensor and the corresponding trilinear form We later contract this tensor along the third dimension while the first two will remain uncontracted.This motivates us to reexpand u (ℓ (a) The order-2 tensor is split into a coefficient matrix indexed by and a Kronecker delta in (b) The -indexed matrix is decomposed into diagonal blocks each indexed by SVD (c) The SVD (or likewise other decompositions) can be applied blockwise, effectively preserving the equivariant structure.We fix ℓ 1 and ℓ 2 , and let ℓ go from |ℓ 1 − ℓ 2 | to ℓ 1 + ℓ 2 .We then introduce an auxiliary ℓ 1 , ℓ 2 -indexed family of matricies The first dimension of each of these matrices is ℓm and the second one is m 1 m 2 .The size of each matrix is exactly (2ℓ 1 + 1)(2ℓ 2 + 1) × (2ℓ 1 + 1)(2ℓ 2 + 1).We denote its inverse as (J ℓ 1 ,ℓ 2 ) −1 (it exists since the 3-j coefficients form a one-to-one mapping) and hence express . Our trilinear form α, after substitution of U instead of u and v becomes a bilinear form The last expression should remind us of the order-2 tensor that we have considered in the last subsection.In other words, if we denote it is the coefficient matrix of the order-2 tensor.This motivates us to consider the following definition of reshaping an order-3 tensor to an order-2 tensor: where the index ((1, 2), 3) indicates that we are merging dimension 1 and 2 together.The rest is relatively straightforward.If the tensor Ã(ℓ 1 m 1 n 1 )(ℓ 2 m 2 n 2 )(ℓ 3 m 3 n 3 ) is contracted with another tensor B(ℓ 3 m 3 n 3 )(ℓ 4 m 4 n 4 )(ℓ 5 m 5 n 5 ) , we reshape the latter's coefficients to B(ℓ 3 n 3 ) (ℓ 3 (ℓ 4 ℓ 5 n 4 n 5 )) and proceed as in Section 4.2.1 for order-2 tensors.That is, for the QR decomposition we let, independently for each ℓ 3 , Â = QR, and assign Â′ := Q and B′ := R B. Likewise, for SVD we decompose Â B = U DV T and assign Â′ := U D 1/2 and B′ := D 1/2 V T .After setting the new cores Â′ and B′ we trivially reshape them back to the sought order-3 cores A ′ and B ′ .
The ideas presented above are illustration in Figure 3.The case of contracting an order-3 tensor with an order-2 tensor is, in the view of the above, is straightforward: we need to reshape the order-3 tensor and keep it as an order-2 tensor.Remark 2. Assuming a QR decomposition of Â B such that Â = Q, an interesting property that follows directly from the orthogonality of Â is the unitarity (more precisely, left-unitarity) of the associated equivariant tensor A. To see this, let where we have adsorbed the prefactor (2ℓ + 1) from (B23) into Â.Hence, there appears to us no obstacle in extending convergence results for the usual TT-ALS algorithm to the equivariant case; the unitarity property (32) hints that an ETT-ALS algorithm that is stabilized using QR decompositions should be guaranteed to converge (cf., Theorem 1 in [36]).
Remark 3 (a potentially more efficient version of SVD-based orthogonalization).The dimension ℓ 3 (ℓ 1 ℓ 2 n 1 n 2 ) of the reshaped matrix may be rather large.Instead of applying SVD to the product Â B which can be rather large, one can consider an SVD applied separately to The expression in the parenthesis is a square matrix with reduced dimension (ℓ 3 n 3 ) which itself can be decomposed as

Tensor optimization algorithms
Reshaping equivariant tensors opens the way to applying a number of algorithms that are standard for the conventional tensor networks.For instance, algorithms based on the Density Matrix Renormalization Group (DMRG) algorithm [28], [37], like TT-DMRG-cross [38] that is applied to the problem of minimizing a quadratic functional of the multidimensional tensor, can now be easily formulated.Indeed, if we have two adjacent cores A and B (like in Section 4.2.2),we can reshape and combine these cores into a large matrix Â B, with respect to which, with all other cores fixed, the optimization problem is quadratic.We can hence solve this quadratic optimization problem, apply the SVD algorithm to decompose, reshape back, and this forms the new cores A ′ and B ′ .
Reshaping two adjacent order-3 tensors into order-2 tensors opens possibilities of transfering standard tensornetwork algorithms onto the equivariant tensor networks For example, for a tensor-train representation of T ⟨d⟩ given by (compared to (29) our (ℓmn)-dependent cores have tildes above them to be consistent with the rest of the notations in this section) the optimization problem F(T ) → min when all cores except for T (i−1) and T (i) are fixed, reduces to the problem schematically written as The coefficients of these two tensors are reshaped to the matrices T i ) and then F is quadratic in terms of C := T i T i+1 .It is hence solved, decomposed into U DV T , and then the updated cores are set as The idea of how the TT-DMRG-cross algorithm can be applied to the two adjacent tensor network cores is diagrammatically illustrated in Figure 4.

Application to Interatomic Potentials
We now use ETNs to construct interatomic potentials.We present our ETN potentials with the classical way using formulas defining the features, then the functional form, etc., as well as the tensor network diagram notation that we have developed in the previous sections.This allows us to easier understand the proposed functional forms, but also to compare those of different potentials.In particular, we present tensor network diagrams for several state-of-the-art MLIPs in Appendix C, including MTP, SNAP, and ACE.

Equivariant Tensor Train Representation of Per-Atom Energies
Before applying our ETN formalism, we first fix some additional notation: in the following, we denote a configuration of atoms r i by {r i }, and a neighborhood of an atom r i by {r ij }.We then define the input feature vectors more explicitly as functions of atomic neighborhoods where r ij is the angular contribution to r ij , and f n is a vector of pair-wise non-angular features that we will define momentarily.We can then write the per-atom energy as follows The total energy of a configuration {r i } is then given as follows Remark 4. In fact, we consider the vectors v ′ = 1, v T T as input vectors to the ETT in order to generate complete polynomials.We omit this detail as it would require some tedious notation, which is not essential to explain the basic construction of the potentials.

Non-Angular Features
In the following, we consider multicomponent systems with N spec atomic species.That is, the feature vectors do not only depend on the atomic positions in the neighborhood of an atom i but also on the species of the i-th atom z i and the species of all atoms z j in its neighborhood.We assume a lexicographical order of the atomic species s = {0, 1, 2, . . ., N spec − 1} and define where s i and s j are the species of the i-th and j-th atoms, respectively.We may then write the non-angular feature vectors as where Q µ (∥r ij ∥) are radial basis functions.The corresponding feature vectors then read A trilinear form of type (24) with such v's is then given by

Equivariant Tensor Network Potentials
One problem that arises when defining a potential energy as done in the previous sections is the growth of the parameter space when the number of features becomes large.Even considering only radial and species features can already lead to a very large parameter space, e.g., for systems with many components, such as high-entropy alloys.To address this problem, we propose in the following a contraction of features before entering the ETT, leading to equivariant tensor network (ETN) potentials.
In order to illustrate this idea, we consider a contraction of only non-angular features and leave the spherical harmonic basis untouched.To that end, we reshape the input feature vectors v (ℓmµβγ) into three-dimensional covariant tensors We then contract µ and (βγ) before entering the ETT, that is, we define the contracted feature vectors as with the ℓ-dependent coefficient tensors We remark here that we never explicitly compute A λ(βγ) z i β z j γ but rather take the (z i s i z j s j )-th column of the matrix A, since this is the only nonzero entry in z i ⊗ z j according to (36).Algorithm 3 summarizes our implementation of computing v.
Algorithm 3: Computation of the contracted feature vectors for the ETN potential ( 43) Output: v In the tensor network diagram notation, the per-atom energy then reads This is, of course, not the only possibility to construct a tensor network potential.For example, if the number of atomic species becomes large, one may envision a separation of z i and z j leading to We call the T i tensors in ( 43) and ( 44) ETT cores and the entire lower part the ETT.
We emphasize that we have used weight sharing between ETT cores for the two ETN potentials defined above, i.e., the parameter tensors A, B, and C, do not depend on d-as opposed to T .We could have likewise used a core-dependent feature contraction, which could simplify the training since the ETNs then linearly depend on each parameter tensor (refer to our remarks in Section 7).

Hyperparameters of Tensor Network Potentials
MLIPs contain a number of hyperparameters, such as the cut-off radius, the choice of the radial basis, etc.In the following, we list the hyperparameters that are specific to ETN potentials and do not occur in other MLIPs.The ETN-specific hyperparameters are related to the ranks of the tensors A, B, and T , as shown in Table 1.All tensorial dimensions are defined per core i and angular momentum ℓ.We remark that the ETT dimension d and the total number of angular momenta for the feature vectors and ETT ranks are also hyperparameters that define the body-order and the polynomial degree of the ETN potential.
Tensor Dimensions Table 1: Tensorial dimensions for the ETN potential (43) 6 Performance of Equivariant Tensor Network Potentials

Optimization of the Hyperparameters
In order to identify the "best" hyperparameters for the ETN potentials, i.e., the hyperparameters that are close-to the Pareto front that minimizes some function with respect to the number of fitting parameters, we propose a stochastic gradient descent algorithm.The function to be minimized could be, e.g., the validation loss, the root-mean-square error of per-atom energies, or some other quantity of interest.Since the total number of hyperparameters becomes large when the ETT dimension and the total number of angular momenta become large, we first place some assumptions based on our "educated guess" for the selection of the hyperparameters: • The total number of angular momenta of the ETT ranks is constant over all cores and, moreover, equal to the total number of angular momenta L of the feature vectors.
• The total number of angular momenta channels of the ETT ranks are constant over all cores, i.e., N rank ett (i, ℓ) = N rank ett (ℓ).

Numerical Examples
We test the algorithm on various datasets for multicomponent systems from the literature.For the optimization of the loss function, we use python's plain BFGS solver.During optimization, we monitor the validation loss.After 1000 iterations, we stop the optimization and select the minimal validation loss from all the iterations.We repeat the optimization three times to compute the mean validation loss.For the weights we choose Further, for all computational results, presented in the following, we use a cut-off radius of 5 Å, if not specified otherwise.We test our algorithm on various datasets from the literature for multicomponent systems and compare the performance of the ETN potentials to moment tensor potentials (MTPs) in terms of the required number of parameters; we refer to Appendix C.1 for the functional form of MTPs.We will analyze the error in the validation loss and the root-mean-square error for the per-atom energies and forces, referred to as RM SE and RM SF in the following.

QM7
The QM7 dataset [39], [40] contains 7211 small molecules with up to 23 atoms and six atomic species (C, N, O, S, Cl, and H) in the their equilibrium position.We select 6186 configurations for the training set, and 1025 configurations for the validation set.We fit to total energies.
The decay of the mean validation loss, computed using the stochastic gradient descent algorithm, is shown in Figure 5 (a).In general, the ETN potentials require much fewer parameters than MTPs for a comparable level of accuracy, especially the ETN potentials with lower complexity.

QM9
The QM9 dataset [41], [42] contains 134k small organic molecules with up to five atomic species (C, H, O, N, F) in the their equilibrium position.We select a random subset of 20k configurations from the full dataset.From these 20k configurations, we use 17k configurations for the training set and 3k configurations for the validation set.We fit to total energies.
As for QM7, ETN potentials with lower complexity require much fewer parameters than MTPs.However, when approaching the realm of higher accuracy (RM SE < 10 meV/atom, cf., Figure 6 (b)), there is no difference anymore between ETN potentials and MTPs.Presently, we attribute this to python's BFGS solver not being the optimal choice for training ETN potentials.Indeed, we have observed premature convergence (after ∼ 50 iterations) for approximately 50 % of the training runs per iteration when the ETN potentials reach a higher complexity (i.e., from ∼ 400-500 coefficients).On the other hand, python's BFGS is also not optimal for MTPs, but MTPs already have a more mature functional form that appears to perform better on QM9.

MoNbTaW Medium-Entropy Alloy
This dataset has been developed to construct a spectral neighbor analysis potential (SNAP) for the MoNbTaW medium-entropy alloy [43].Hence, we refer to this dataset as MoNbTaW-MEA in the following.The dataset contains 5529 configurations containing various configuration types, e.g., configurations with free surfaces, snapshots from molecular dynamics simulations at finite temperature, etc.We use 4983 configurations for the training set and 546 for the validation set (the validation set is constructed such that every configuration type is represented).We fit to total energies and forces.
Over the entire range of potentials, the amount of coefficients is universally lower than the amount of coefficients for MTPs when both potentials give similar accuracy, as shown in Figure 7 (a) and (b).Even when approaching the realm of high accuracy (RM SE ∼ 5-6 meV/atom, RM SF ∼ 150-180 meV/ Å), ETN potentials require 2-3 times less parameters than MTPs.We attribute this excellent performance to the ability of ETN potentials to learn similarities between atomic species by varying the species rank N rank spec since all atoms belong to the group of transition metals.

MoNbTaVW High-Entropy Alloy
This dataset, henceforth referred to as MoNbTaVW-HEA, appears to us the first dataset generated for a high-entropy alloy (five components or more) and is more general than the MoNbTaW-MEA in the sense that it contains a more diverse set of configurations, including various defects (vacancies, di-and tri-vacancies, self-interstitial atoms, stacking faults) and liquid states [44].The whole dataset contains 2859 configurations.The dataset also contains many out-ofequilibrium configurations with very large forces (up to ∼ 150 eV/ Å), such as dimers.We have removed configurations that contain forces > 5 eV/ Å leading to a subset of the full dataset containing 1423 configurations.From this subset we use 1210 configurations for training, and 213 for validation.We fit to total energies and forces.
From Figure 8, we observe that the performance of ETN potentials is similar to MoNbTaW-MEA.In comparison to MTPs, ETN potentials require many times less coeffients to reach a comparable mean validation loss.We yet remark that, for MoNbTaVW-HEA, it now appears to be more difficult to fit MTPs as the fluctuation in the loss function is much larger than for ETN potentials.Nevertheless, even when comparing the mean validation loss of ETN potentials and the minimum validation loss of MTPs, ETN potentials still require 2-3 times less coefficients when both potentials give comparable accuracy.

Binary Alloys with 10 different Species
We now test the performance of ETN potentials on a dataset containing 10 binary alloys [45], with 10 different atomic species in total (AgCu, AlFe, AlMg, AlNi, AlTi, CoNi, CuFe, CuNi, FeV, and NbNi).We refer to it as BA10 in the following.BA10 contains all possible fcc, bcc, and hcp structures, with up to eight atoms in the unit cell.In total, BA10 contains 15950 configurations.We select 13558 configurations that serve as the training set, the remaining 2392 configurations serve as the validation set.Since BA10 only contains equilibrium structures we only fit to total energies.Further, we choose a cut-off radius of 7 Å.
As for the previous alloy datasets MoNbTaW-MEA and MoNbTaVW-HEA, ETN potentials generally require less coefficienst to reach the accuracy of MTPs, as shown in Figure 9, although the difference is less pronounced (∼ 1.5-1.8,as compared to > 2 for MoNbTaW-MEA and MoNbTaVW-HEA).Since BA10 only contains equilibrium structures, it also appears not too difficult to fit an MTP to it.We yet remark that, for an ETN potential with 10 atomic species, the structure of the ETN (43) may not be optimal since the size of the tensor A ℓ is already 100 × N rank spec (ℓ).For example, for the ETN potential with L = 5 and N rank spec = 3 (cf., Figure 10) the size of all A ℓ 's is 800.We think that an ETN of type (44) that also splits the species features could substantially reduce this amount.We leave this for exploration in future work.

Discussion
Evolution of the hyperparameters We now analyze how the hyperparameters evolve during the optimization procedure.For each dataset, the ETN hyperparameters H are shown in Figure 10 as a function of the number of coefficients/iteration corresponding to Figure 5-9.
The number radial basis functions N rad grows approximately similar for all datasets, up to 8-11, except for QM9, where N rad grows up to 15.
For the ETT order d and the maximum angular momentum L, we observe an interesting difference between the molecule datasets and the alloy datasets: while d dominates over L for the molecule datasets throughout the This indicates that the body-order is more important for molecules than for alloys, and, vice-versa, the polynomial degree is more important for alloys than for molecules.In addition, we observe that the order does not increase beyond 3 for the alloy datasets for interatomic potentials, i.e., for MoNbTaW-MEA and MoNbTaVW-HEA.Order-3 ETN potentials are more sparse since we can neglect the imaginary parts of the equivariant tensors (cf., Section 3.4.2).This implies that we can construct highly efficient ETN potentials for many-component alloys.
For the ranks, we observe that the species rank N rank spec increases up to 5 and 4, respectively, for the molecule datasets, and up to 3 for the alloy datasets.This is not surprising since QM7 and QM9 contain atoms from different groups (nonmetals and halogens), while the alloy datasets only contain atoms from the group of transition metals, except BA10, but BA10 does not contain all possible pair-wise combinations of atomic species.Moreover, we observe that the radial rank N rank rad always settles at some value that is ∼ 2-3 times lower than N rad .This indicates that our assumption of letting the ranks decay with increasing angular momentum ℓ appears to be a reasonable choice.The ETT rank N rank ett is generally lower than N rank rad , which is is expected since a N rank ett > N rank rad may "overparameterize" the potential.The observed N rank rad > N rank ett thus further confirms that our setup of constructing ETN potentials is indeed suitable and validates the correct functioning of the optimization algorithm.
Influences due to commonalities and differences across the training sets In the future, it would be interesting to better understand the origin of the observed compression rates.While this would require much more extensive testing with more data, some conclusions can already be drawn.Transition metals have very similar physical and material properties.Since the developers of MoNbTaW-MEA and MoNbTaVW-HEA were mainly concerned with those properties, it is not surprising that adding V to MoNbTaW did not alter the species ranks.This picture might change however when considering chemical reactive properties with oxygen, e.g., for catalysts.On the other hand, QM7 and QM9 consider a wider class of molecular structures that have different bonding properties and it thus perhaps not surprising that the species ranks are higher compared to the training sets composed of transition metals.This may also explain why those datasets need a higher body-order because angles between bonds become more important.
To design more efficient networks for universal potentials, one strategy could be to first group similar species, contract them, and then take the contraction between different groups.Another interesting direction could be establishing metrics that correlate element-specific or bond parameters like average number of d-valence electrons, or bandwidths, with the compression rates.Such correlations could then be used to guide setting up a good initial ETN structure that would not require too much optimization.

Comparison with non-polynomial MLIPs
From the two other popular classes of MLIPs, neural network potentials, and Gaussian process regression-based potentials, neural networks achieve state-of-the-art performance for molecule datasets.However, we have found in Section 6.2.2 that converging ETN potentials on molecules with the python BFGS solver is difficult, in particular for ETN potentials with higher body-order.While we have shown in Section 4 that alternating least-squares solvers are in principle stable for ETNs, they still need to be implemented and we thus postpone a comparison for molecules to future work.We remark, however, that neural networks have already been frequently benchmarked against MTPs.On unary systems (Li,Mo,Cu,Ni,Si,Ge), MTPs have been shown to reach higher accuracies for a comparable number of parameters than neural networks [15].For the molecule database QM9, one of the most accurate implementations, HIP-NN [46], requires approximately an order of magnitude more parameters to reach the same mean absolute error of about 18 meV than MTPs.The according to Papers with Code presently best neural network model for QM9, TensorNet [47], has reached a mean absolute error of about 4 meV.However, the network architecture in [47] requires a number of parameters of the order of one million-this clearly emphasizes the need for more efficient representations like ETNs.
In the case of alloys, the-in our view-presently most interesting dataset is MoNbTaVW-HEA because it contains configurations with lattice defects and liquid configurations that are all relevant for predicting mechanical properties.For this dataset, several Gaussian approximation potentials (GAPs) have been proposed in recent works by Darby, Kermode, and Csányi [17] and Darby, Kovács, Batatia, et al. [18], which also introduce compression schemes comparable to our tensor network approach.The GAP in [17] (in the following cGAP) uses a compression scheme that exploits symmetries in the power spectrum; the GAP in [18] (in the following trGAP) introduces a tensor-reduced parameter tensor using the canonical polyadic decomposition.
In order to compare those GAPs with ETN potentials, we have picked the three ETN potentials shown in Table 2 that were found during the optimization (Figure 8).All potentials use four-body interactions (d = 3), but a different number of radial basis functions, angular momenta, and ranks.As in [17], [18], we have taken the reduced subset of 2329 configurations from MoNbTaVW-HEA for the training, and benchmark the potentials on the test set from [44].Moreover, we now use 5000 BFGS iterations; all other parameters remain the same.
We then compare these three ETN potentials with the best (non-compressed) GAP, cGAP, and trGAP, which use three-body interactions.The GAP and cGAP use 12 radial basis functions and a maximum angular momenta of 9.The trGAP uses 8 radial basis functions and and a maximum angular momentum of 4, as for the ETN3 potential.Hence, the complexity of contracting radial basis functions and spherical harmonics is similar or lower for the ETN potentials.
Interestingly, an ETN potential with a maximum angular momentum of only 1 (ETN2) achieves a comparable accuracy to that of the three GAPs (Table 3).The energy error for the best ETN potential (ETN3) is the same as for the best GAP (trGAP), while the force error is lower, which might come from the higher body-order.However, we remark that increasing the body-order in GAPs is computationally expensive, while ETNs can achieve a linear scaling (at least theoretically).In addition, a GAP with higher body-order likely requires more sparse points/training data to achieve the same accuracy.
Our findings and those of Darby et al. thus indicate that, with a suitable compression of the feature space, MLIPs with higher body-order may require less training data than without any compression.Our optimization algorithm provides an efficient means for finding a suitable compression, something that would have been difficult to achieve with expensive grid searches.2 and the best GAPs reported in [17], [18] (the values are deduced from Figure 8 in [17], and from Figure 3 & 8 in [18]).The ETN test errors are the average errors over five training runs with different parameter initializations

Generalizations to Other Tensor Networks
We think of the above construction as the simplest construction incorporating our main contribution-a way of incorporating equivariance in tensor networks.As hinted in Section 4, many ideas developed for the conventional tensor networks are transferable to the equivariant tensor networks.Below we list some of the ideas that we find can be immediately applied to the equivariant tensor networks.

Arbitrary topology of tensor network cores
The method of reshaping of two adjacent order-3 tensors, as illustrated Figure 4, as well as noting that it would also trivially work with one or two order-2 tensors, can be applied to construct an arbitrary topology involving order-3 tensors.For instance the hierarchical Tucker format or PEPs, as sketched in Section A can readily be realized with our construction.
To implement PEPs or other tensor formats with tensors with order higher than three, one either has to use higher-order 3-j (i.e., "4-j", "5-j", etc.) tensors along with learnable coefficients or split the large order-5 tensor (as features in the two-dimensional PEPs) into three order-3 tensors which can then be symmetrized to avoid bias in the x vs. y coordinate.A possible way of doing it for the two-dimensional PEPs is .

Weight sharing
The ETNs as introduced in Section 4 depend linearly on the learnable weights in each of the tensors, however, already in our numerical experiments in Section 6.2 we used weight sharing for the tensors that compress the input features.This resulted in a more parameter-efficient model, however, some of the (multi-)linear algebra algorithms (like DMRG or ALS) would not apply.

Richer atomic features
The flexible nature of equivariant tensor networks allows for equivariantly incorporating more atomic features like magnetic moments, charge, dipole moments, etc. [48], [49], and at the same time compress them in an efficient and physically interpretable manner.
For instance, two possible way of treating magnetic moments (m i on the central atom and m j on the neighboring atom) are ij = .
Here in v (1) ij we mix atomic species with magnetic moments into combined features with tensors A 1 and A 2 .Physically, we would say that we construct features that depend on the number (z) and magnetic state (m) of the core electrons of the i-th and j-th atoms.Also, from the physical point of view it may be reasonable for them to share weights, i.e., to fix A 1 = A 2 .Then, these combined features are mixed together to produce information how the (z, m) pair for the central atom should interact with another (z, m) of the neighboring atom, which is then contracted with the relative positions r ij of these atoms to form the final feature vector of interaction of the i-th and j-th atoms.A possible computational advantage of this method is that combining the m and z features with the A 2 vector can be precomputed and reused while processing different neighborhoods.
In v (2) , instead, z i and z j are contracted using the learnable Z tensor-this operation is essentially a look-up for the (z i , z j )-index feature vector which is then contracted with a similarly formed feature vector obtained from m i , m j , and r ij .Which of these two (or many other possible) variants is better in terms of accuracy or efficiency is yet to be tested.
Another example is adding non-atomic features like electronic temperature T el to the potential [50]- [52].A way to do it could be adding an atomic feature to all the feature vectors F or, instead, make it a separate model input like Electronic temperature contribution .

Message passing/graph convolutions, and pooling operations
To introduce the message passing (also known as graph convolution operations) [7], [53], [54] we recall that each of the input features F i,(ℓmn) also depend on the particular atom i which forms another tensor dimension.Instead of just summing all the features over the neighborhood of the i-th atom, we explicitly introduce the convolution operator K as where k is some smooth cut-off function that serves as the convolution kernel.The operation itself acts on features F i,(ℓmn) = , and distributes them over to the neighboring atoms: Then, for example, a model in which features F are passed to the neighbors, convolved with F , the result of which is passed to the neighbors and again convolved with F , is given by B K A(KF )F F and diagrammatically expressed as Again, operations on adjacent kernels, like A and B in this example, are possible as they are independent of the convolution operations K.The upper "hanging" link indicates that the result is the per-atom energy that should then be summed over this dimension (i.e., over i).
Pooling operations require simply defining an averaging operator over the atoms i.This is generally considered not useful for interatomic potentials because they destroy the locality (short-sightedness) principle, but may be good for cheminformatics models like the prediction of the band gap.For example, a model that takes atomic features F , passes it over to the neighbors, contracts with the original features F using the tensor A, averages, and contracts these averaged features with the averaging over the original features F would be av av (here av is averaging over the atoms).If one wants to use this block to inform a regular interatomic-potential-like model (leading to something resembling the attention mechanism), one could design something like

,
where in the diagram for conciseness we did not illustrate the part with compressing radial, species, etc., information into a single feature vector like it is done with tensors A, B, and C in (44).

Concluding Remarks
In this work, we have developed a formalism for constructing equivariant tensor networks (ETNs), i.e., tensor networks that remain invariant under group actions of SO(3).Our main findings is that-analogous to conventional tensor networks-we only require up-to-order-3 equivariant tensors in order to construct arbitrary ETNs.In our view, most importantly, our formalism of constructing ETNs can therefore lead to a drastic simplification of constructing machine-learning interatomic potentials of arbitrary complexity, which is our main target of application.Our confidence grounds on the fact that many algorithms developed for conventional tensor networks translate almost verbatim to the equivariant case, e.g., automatic differentiation, or tensor orthogonalization (cf.Section 4).So, once these algorithms are implemented, the only thing that would be left is to code the tensor network diagram (and this might be even possible using visual programming environments at some point)-all the rest would literally "fall from the tree".
Representing interatomic interaction energies using ETNs can be considered as a high-performance representation of polynomial machine-learning interatomic potentials that naturally allows for basis-and feature-compression.The numerical results presented in Section 6 indicates that ETNs can reach excellent performance for complex systems.This ETN potential outperformed moment tensor potentials on several databases for multicomponent systems in terms of the required number of parameters by factor of about 2 to 3. Interestingly, we have observed that the optimal hyperparameters do not evolve comparably for different species; for example, molecules benefit from a higher body-order, while metallic alloys benefit from a higher polynomial degree.Future research may thus be devoted to developing improved algorithms that find the best ETN hyperparameters for a given problem.The fact that we could significantly compress the feature space for an ETN potential with a still rather low number of features (positional and species) gives us hope that we can then reach even much better compression rates when adding new features, e.g., magnetic moments.We are planning to explore this in future work, alongside with extending ETNs in many other directions some of which we have outlined in Section 7.
Also, unlike linearly parameterized potentials like MTP or ACE in which the construction of different basis functions requires different operations, the ETN potentials are based on tensorial contractions, which have been easier to parallelize on massively parallel architectures such as GPUs-this could also become another advantage of the ETN models.
Finally, we would also like to emphasize that ETNs are not limited to interatomic potentials but can generally be used to represent symmetry-preserving multivariate functions.As such, they could also be applied to continuous problems like differential equations where they would share the same advantages as conventional tensor networks of avoiding a large sampling space (e.g., [55]), compared to neural networks.

Acknowledgments
M.H. gratefully acknowledges the financial support under the scope of the COMET program within the K2 Center "Integrated Computational Material, Process and Product Engineering (IC-MPPE)" (Project No 886385).This program is supported by the Austrian Federal Ministries for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) and for Labour and Economy (BMAW), represented by the Austrian Research Promotion Agency (FFG), and the federal states of Styria, Upper Austria and Tyrol.A.S. was supported by the Russian Science Foundation (Grant No. 23-13-00332).
we deduce To simplify the latter expression, we use the symmetry property of the Wigner 3-j symbol (cf., [32]) and the relations where is the sign function.Then, where the + and − in the second parenthesis hold for even or odd ℓ 1 + ℓ 2 + ℓ 3 , respectively.We omit the derivations for m 1 , m 2 = 0 as they follow trivially from the previous one and just state the final results: for m 2 = 0, we have and for m 1 = 0 For case (iii), by using (B2) in (10), we have Using (B4) and (B5), we have where, again, the + and − in the second parenthesis hold for even or odd ℓ 1 + ℓ 2 + ℓ 3 , respectively.From (B1), (B7)-(B9), and (B11), it follows that the 3-j symbol for RSHs obeys the selection rules [SR3] and [SR4], in addition to the triangular inequality [SR1] for the three ℓ's.
Further, for the 3-j symbol for RSHs being a real number ̸ = 0, one of the m i 's must be positive which follows from the definition of U T * ℓ i m i m i (B2).Vice versa, for being a complex number with pure imaginary part ̸ = 0, one of the m i 's must be negative.Therefore, the 3-j symbol for RSHs is always real whenever ℓ 1 + ℓ 2 + ℓ 3 is even-and pure imaginary whenever ℓ 1 + ℓ 2 + ℓ 3 is odd.

B.2 Sparsity
To analyze the sparsity, i.e., the number of nonzero entries divided by the size of the tensor, we consider 3-j symbols with some fixed maximum angular momentum.The sparsity of the 3-j symbols for real and complex spherical harmonics is shown in Figure 11 (a).
Interestingly, even though the 3-j symbol for RSHs is slightly less sparse, using an RSH basis requires slightly less floating point operations (FLOPs) for tensor contractions than using a complex spherical harmonics basis.To see this, let T be an equivariant tensor of size M × M × M and let its channels decay with increasing ℓ according to eq. ( 45).The maximum number of channels is denoted by N = N (ℓ = 0).
We exemplify our analysis by means of the contraction operation u i = T ijk v j w k , that is most relevant for contracting our ETNs, where u i and w i are two covariant but otherwise arbitrary vectors, and v j is the input feature vector.To that end, first suppose that we are using a complex spherical harmonics basis.Then, T ijk is a real-valued tensor with N csh nz nonzero entries, and the feature vector v j is complex-valued.The contraction can thus be split into the following operations S re ik = T ijk v re Figure 11: (a) Sparsity of the 3-j symbols for real and complex spherical harmonics, and the order-2 tensor S from (B12).(b) Ratio of required FLOPs for the contractions u i = T ijk v j w k with a different maximum number of channels N when using a complex and a real spherical harmonics basis, respectively with the corresponding FLOPs given in parenthesis; we have assumed fused multiply-add instructions and that the intermediate tensor S ik is stored in dense format as it is not very sparse (cf., Figure 11 (a)).Now assume a real basis.Then, T ijk is a complex-valued tensor with N rsh nz nonzero entries, and the feature vector v j is real-valued.The first two operations from (B12) therefore change to

B.3 Contraction Properties
Generating a new covariant vector by contracting the usual 3-j symbol with two covariant vectors, u ℓ 1 m 1 and u ℓ 2 m 2 , yields another covariant vector (−1) m 3 w ℓ 3 −m 3 with a phase shift.To see this, consider the 3-j symbol in terms of the Clebsch-Gordan coefficients C and recall that the Clebsch-Gordan coefficients expand the basis |ℓ 3 m 3 ⟩ in terms of |ℓ 1 m 1 ⟩ and |ℓ 2 m 2 ⟩.This phase shift is naturally reversed when using the 3-j symbol for RSHs.To understand why this is the case, consider the contraction

B.4 Symmetry Properties
The 3-j symbols for RSHs obey the same symmetry properties for even and odd permutations than the usual 3-j symbol, i.e., (−1) On the other hand, changing the sign of the m's gives, under the assumption that [SR3] holds,

B.5 Orthogonality Properties
Recall that the usual 3-j symbol has the following orthogonality property [32] (2ℓ + 1) which includes all M µ ⟨ν⟩ 's for a given MTP level, reshaped to a vector; we have chosen its index to be ( l m) to highlight the similarity with respect to the spherical harmonics basis.The per-atom energy can then be written as E({r ij }) = α θ α Bα(µ 1 l1 m1 )...(µ d ld md ) v (µ 1 l1 m1 ) . . .v (µ d ld md ) , (C4) with v (µ i li mi ) = j C µ i n i β i γ i F n i β i γ i ( li mi ) .(C5) Here, B• is an integer tensor that encodes all O(3)-invariant contractions of the features for a given level.MTPs can then be written in the tensor network diagram notation as follows This diagram should be understood that, unlike using the 3-j (or for that matter Clebsch-Gordan tensor) multiple times to get higher (than three) order polynomials, MTPs use an alternative way of contracting the moments to produce a complete set of basis functions.

C.2 Spectral Neighbor Analysis and Atomic Cluster Expansion Potentials
The Spectral Neighbor Analysis (SNAP) [8] potentials are four-body potentials (i.e., that multiply the features F with each other three times) and have the following diagram: Here we simplify the diagram by not explicitly distinguishing the summation-over-neighbors operations and not considering separately radial from chemical features (thus not distinguishing the original SNAP from explicit multielement SNAP [57]).These potentials use Clebsch-Gordan coefficients to ensure the rotational symmetry and have the order-3 (ℓn)-parametrized tensor θ with learnable coefficients.If we replace the Clebsch-Gordan coefficients with the 3-j coefficients the way they are introduced in this paper, then SNAP can be considered as simply contracting the three feature vectors with a single equivariant order-3 tensor as introduced in Figure 3(a).The Atomic Cluster Expansion (ACE) potentials [10] replicate the construction (C7) to a chain of Clebsch-Gordan tensors and a large-order tensor of coefficients θ, the example of which being Compared to MTP, ACE uses the procedure based on the Clebsch-Gordan coefficients to produce the set of basis functions (also complete) that is indexed by (ℓ i , m i ) and whose coefficients form a tensor.In a very recent work a canonical tensor decomposition was applied to ACE to compress the coefficient tensor [18].Thus, comparing ACE and ETN to SNAP (all three three potentials use spherical harmonics), one can say that ACE generalizes SNAP by chaining the Clebsch-Gordan tensors and forming a large coefficient tensor preserving linearity of the model, while ETN generalizes SNAP by chaining order-3 equivariant tensors, moving from linear SNAP to a multilinear model.

Figure 1 :
Figure 1: Example of diagrams of basic tensors and operations on them.

Figure 2 :
Figure 2: Diagrammatic illustration of concepts in and operations on order-2 tensors.
The 3-j symbol can be interpreted as reducing the double input into a single one with , allowing for reshaping the coefficient matrix into an order-2 matrix

Figure 3 :
Figure 3: Diagrammatic illustration of concepts in and operations on order-3 tensors.

Figure 4 :
Figure4: Diagrammatic illustration of how one can conduct QR, SVD, TT-DMRG-cross, or other algorithms on two adjacent order-3 equivariant tensors.They are reduced (reshaped) to the product of two order-2 tensors, operations on which (e.g., matrix multiplication) can be done block-wise with respect to ℓ.

Figure 5 :
Figure 5: (a) Mean validation loss, and (b) RM SE, as a function of the number of parameters for QM7.The horizontal bars correspond to the minimum and maximum values

Figure 6 :Figure 7 :
Figure 6: (a) Mean validation loss, and (b) RM SE, as a function of the number of parameters for QM9.The horizontal bars correspond to the minimum and maximum values (a) (b)

Figure 8 :Figure 9 :
Figure 8: (a) Mean validation loss, and (b) RM SE/RM SF , as a function of the number of parameters for MoNbTaVW-HEA.The horizontal bars correspond to the minimum and maximum values (a) (b)

Figure 10 :
Figure 10: Evolution of the ETN hyperparameters corresponding to the each iteration of the optimization algorithm for the five datasets from Figure 5-9

j
(N csh nz FLOPs)S im ik = T ijk v im j (N csh nz FLOPs) u re i = S re ik w re k + S im ik w im k (2M2FLOPs)u im i = S re ik w im k + S im ik w re k (2M2FLOPs) ,

S
re ik = T re ijk v j (N rsh/re nz FLOPs)S im ik = T im ijk v j (N rsh/imnz of nonzeros of the real and imaginary part of T , respectively.Due to the fact that an entry of T is either pure real or pure imaginary, we have N rsh nz = N rsh/re nz + N rsh/im nz .Hence, using RSHs requires less FLOPs since N rsh nz + 4M 2 < 2N csh nz + 4M 2 for all tensor sizes shown in Figure11(b).

Table 2 :
Hyperparameters N rad , d, L, N rank spec , N rank rad , N rank ett and number of coefficients of the three ETN potentials used for the comparison in Section 6.3, where N rad is the maximum number of radial basis functions, d is the order of the ETT, L is the maximum number of angular momenta, and N rank spec , N rank rad , and N rank ett , are the ranks of A, B, and T , respectively

Table 3 :
MoNbTaVW-HEA test errors for the three ETN potentials from Table