TAMM: Tensor Algebra for Many-body Methods

Tensor contraction operations in computational chemistry consume significant fractions of computing time on large-scale computing platforms. The widespread use of tensor contractions between large multi-dimensional tensors in describing electronic structure theory has motivated the development of multiple tensor algebra frameworks targeting heterogeneous computing platforms. In this paper, we present Tensor Algebra for Many-body Methods (TAMM), a framework for productive and performance-portable development of scalable computational chemistry methods. The TAMM framework decouples the specification of the computation and the execution of these operations on available high-performance computing systems. With this design choice, the scientific application developers (domain scientists) can focus on the algorithmic requirements using the tensor algebra interface provided by TAMM whereas high-performance computing developers can focus on various optimizations on the underlying constructs such as efficient data distribution, optimized scheduling algorithms, efficient use of intra-node resources (e.g., GPUs). The modular structure of TAMM allows it to be extended to support different hardware architectures and incorporate new algorithmic advances. We describe the TAMM framework and our approach to sustainable development of tensor contraction-based methods in computational chemistry applications. We present case studies that highlight the ease of use as well as the performance and productivity gains compared to other implementations.

Tensor algebra operations such as contractions in computational chemistry consume a significant fraction of the computing time on large-scale computing platforms. The widespread use of tensor contractions between large multidimensional tensors in describing electronic structure theory has motivated the development of multiple tensor algebra frameworks targeting heterogeneous computing platforms. In this paper, we present Tensor Algebra for Many-body Methods (TAMM), a framework for productive and performance-portable development of scalable computational chemistry methods. TAMM decouples the specification of the computation from the execution of these operations on available high-performance computing systems. With this design choice, the scientific application developers (domain scientists) can focus on the algorithmic requirements using the tensor algebra interface provided by TAMM, whereas high-performance computing developers can direct their attention to various optimizations on the underlying constructs, such as efficient data distribution, optimized scheduling algorithms, and efficient use of intra-node resources (e.g., graphics processing units). The modular structure of TAMM allows it to support different hardware architectures and incorporate new algorithmic advances. We describe the TAMM framework and our approach to the sustainable development of scalable ground-and excited-state electronic structure methods. We present case studies highlighting the ease of use, including the performance and productivity gains compared to other frameworks.

I. INTRODUCTION
Enabling highly-scalable computational environments that abstract and automate the development of complex tensor algebra operations is critical in order to advance computational chemistry towards more complex and accurate formulations capable of taking advantage of emerging exascale computing architectures and also serve as a foundation for a sustainable and portable electronic structure software stack. In particular, tensor contractions (TCs) are a universal language used in many areas of quantum mechanics to encode equations describing collective phenomena in many-body quantum systems encountered in quantum field theory, quantum hydrodynamics, nuclear structure theory, material sciences, and quantum chemistry. Typically, contractions between multi-dimensional tensors stem from the discretization procedures used to represent the Schrödinger equation in a finite-dimensional algebraic form. An important area where tensor contractions play a crucial role is electronic structure theory, where tensors describe basic interactions and parameters defining wave function expansions.
One of the most critical applications of TCs is the coupleda) Electronic mail: erdal.mutlu@pnnl.gov b) Electronic mail: ajay.panyala@pnnl.gov c) Electronic mail: sriram.krishnamoorthy@gmail.com cluster (CC) formalism. [1][2][3][4][5][6][7][8] In the CC theory, the form of the complex TCs used to represent non-linear equations describing the correlated behavior of electrons in molecular systems also reflects the fundamental feature of the CC formalism referred to as the size-extensivity or proper scaling of the energy with the number of particles. For this reason, the CC formalism is one of the most accurate computational models used these days in computational chemistry. The CC theory has assumed a central role in high-accuracy computational chemistry and still attracts much attention in theoretical developments and numerical implementations. In this effort, high-performance computing (HPC) and the possibility of performing TCs in parallel play an essential role in addressing steep polynomial scaling as a function of system size and extending CC formalism to realistic chemical problems described by thousands of basis set functions. To understand the scale of the problem, canonical CC formulations such as the ubiquitous CCSD(T) approach 9 (CC with iterative single and double excitations with non-iterative corrections due to triple excitations) involve contractions between four-dimensional tensors where ranges of thousands of entries can define each dimension. In order to alleviate efficient CC calculations on parallel platforms, several specialized tensor libraries have been developed over the last decade.
In the last few decades, significant effort has been expended toward enabling CC simulations for very large chemical systems to extend the applicability of the CC formalism further. In the reduced-scaling formulations, commonly referred to as the local CC methods, [10][11][12][13][14][15][16][17][18] one takes advantage of the local character of correlation effects to effectively reduce the number of parameters and the overall cost of CC iterations, allowing for calculations of systems described by 10,000-40,000 basis set functions. This paper discusses a new tensor library, Tensor Algebra for Many-body Methods (TAMM), which provides a flexible environment for expressing and implementing TCs for canonical and local CC approximations. The development of functionalities for the ground-state formulations []canonical CCSD and CCSD(T) methods and local CC formulations] using TAMM is supported by the NWChemEx project, 19 whereas excited-state formulations [CC Green's function (CCGF), Equation-of-Motion (EOM) CC, and Timedependent CC (TDCC)] are supported by the SPEC project. 20 As HPC systems continue to evolve to include different types of accelerators, diverse memory hierarchy configurations, and varying intra-and inter-node topologies, there is a need to decouple the development of electronic structure methods from their optimizations for various platforms. TAMM enables the compact specification of various electronic structure methods that heavily rely on tensor operations while allowing independent yet incremental development of optimization strategies to map these methods to current and emerging heterogeneous platforms.
The rest of the paper is organized as follows: Section II briefly discusses other tensor algebra frameworks used for developing computational chemistry applications. Section III describes the details of our tensor algebra interface and the underlying constructs that are used to efficiently distribute tensor data and schedule and execute tensor operations on modern HPC platforms. Later in III C, we provide a feature comparison with other distributed tensor algebra frameworks. Section IV showcases multiple CC methods implemented using TAMM and performance results obtained using the OLCF Summit Supercomputer. 21 Section V demonstrates the performance of TAMM in comparison to other distributed tensor algebra frameworks.

II. REVIEW OF EXISTING INFRASTRUCTURE
Tensor-based scientific applications have been widely used in different domains, from scientific simulation to more recent machine learning(ML)-based scientific applications. Over the years, program synthesis and code generation have become the go-to solution for such applications. The Tensor Contraction Engine (TCE), 22 which is used in the NWChem computational chemistry software package, 23 has been one such successful solution for automatically generating parallel code for various molecular orbital (MO) basis CC methods in Fortran 77 . In later work, the TCE project 24 added support for optimizations on tensor expression factorization, optimized code generation for various hardware, and space time trade-offs in order to improve and also implement more complex CC methods.
In a separate effort, the FLAME project 25 provided formal description support for describing linear algebra operations over matrices with support for the optimized implementation of these kernels for distributed memory systems. Later, various studies over-optimizing tensor algebra [26][27][28] have been proposed using the FLAME framework.
The Cyclops Tensor Framework (CTF) 29 was developed aiming for a more efficient kernel implementation for tensor operations using concurrency. The framework focused on reducing the required communication in parallel executions of CC-based methods by using a dense triangular tensor representation. Solomonik and Hoefler extended CTF for general sparse computations. 30 Manzer et al. demonstrated the benefits of exploiting block sparsity in electronic structure calculations. 31 Neither approach includes notational support for the general block-structured nature of the sparsity that naturally occurs in electronic structure methods.
Epifanovsky et al. 32 developed an object-oriented C++ library called libtensor for efficient execution of post-Hartree Fock electronic structure methods using a blocked representation of large-size dense tensors. In later work, 33 they optimized various operations using efficient memory management techniques that are thread-friendly and NUMA-aware. Unlike TAMM, libtensor does not offer a distributed infrastructure for tensor algebra operations and is restricted to shared memory systems.
Orca is a general quantum chemistry program involving implementations of reduced-scaling methods. In order to achieve the speed up, various approximations are employed, for example, density fitting, RIJ-COSX, or a local approach for NEVPT2 or CC methods. The C++ code employs message passing interface (MPI)-based parallelization schemes, and recently, in a pilot study, they implemented a scheme for generating 3-and 4-index integrals via accelerators. 34 Orca provides a wide range of computational chemistry methods, whereas TAMM provides a general distributed tensor algebra framework to enable the productive development of scalable computational chemistry applications.
The ExaTensor 35 library uses a domain-specific virtual processor concept to allow performance portable programming and execution of tensor operations on modern graphics processing unit architectures. While the library is mostly focused on general GPU computation and the portability of such operations to different systems, they demonstrated its effectiveness on numerical tensor algebra workloads that mainly focus on quantum many-body computations on HPC platforms. The main focus of the ExaTensor library is to utilize GPUs for efficient tensor contractions. TAMM, on the other hand, provides a large set of supported tensor operations and utility routines. TAMM also provides special constructs allowing users to add new operations on tensors that are needed for the development of complete computational chemistry applications.
The TiledArray framework 36 is actively being developed for scalable tensor operations that support various computational chemistry methods. 37 It makes use of the Multi resolution Adaptive Numerical Environment for Scientific Simulation (MADNESS) parallel runtime, which 38 that employs a highlevel software environment for increasing both programmer productivity and code scalability in massively parallel systems. TiledArray employs a hierarchical view of sparsity coupled with explicit user-written loop nests to perform specialized operations over sparse tensors. TiledArray has several similarities with TAMM while also differing in several key aspects, which are explained later in Section III C.
DISTAL 39 is another recently developed distributed tensor algebra compiler that allows expressing computation on tensors as tensor algebra expressions. DISTAL lets users describe how the data and computation of a tensor algebra expression map onto a target machine allowing separation of the specifications of data distribution and computation distributions. DISTAL lets users specialize computation to the way that data are already laid out or easily transform data between distributed layouts to match the computation. TAMM, on the other hand, offers generality by decomposing tensor algebra expressions into a series of distributed matrix multiplication and transposition operations. In a more recent work, the authors also introduced SpDISTAL 40 , which adds support for sparse data structures and allows users to define generic sparse tensors. SpDISTAL then generates code for executing operations over these tensors. Both DISTAL and TAMM have several similarities in the aspects of storage, computation, and scheduling. However, they also differ in certain key aspects, which are further elaborated in Section III C.

III. TAMM FRAMEWORK
This section provides a detailed explanation of our tensor algebra framework. Figure 1 shows the conceptual overview of our framework. TAMM provides a tensor algebra interface that allows users to describe tensor operations in a familiar mathematical notation while underneath it employs efficient data distribution/management schemes as well as efficient operation execution on both central processing unit (CPU)-based and GPU-based hardware. Our framework leverages efficient libraries such as HPTT 41 for tensor index permutations on CPUs, LibreTT 42 for tensor index permutations on GPUs, and Global Arrays, 43 Figure 2: Source code example for IndexSpace, TiledIndexSpace, and Tensor construction using TAMM.

A. Tensor algebra operations in TAMM
TAMM provides a flexible infrastructure for describing tensor objects using the general notion of an index space (simply an index range) that is used to describe the size of each dimension. TAMM also employs tiling capabilities, where users can define arbitrary or fixed size tiling over index spaces to construct tiled index spaces to represent a blocked representation of a tensor. This allows tensors to be constructed as a set of smaller blocks of data indexed by the Cartesian product of the number of tiles on each dimension. This notion of tiling enables efficient data distribution and efficient tensor operations (such as tensor contractions) that can better leverage the underlying hardware by utilizing the available cache system as well as the execution modules (i.e., GPUs and CPUs). TAMM's flexible tiling capabilities provide crucial support for load balancing and making effective trade-offs over data communication and efficient computation. These optimizations will be detailed in Subsection III B, where we describe the data communication and execution of tensor operations. Figure 2 shows an example of index space, tiled index space, and tensor construction in the TAMM framework. Lines 2-4 show the construction of IndexSpace objects that represent a range of indices using a range constructor. Our framework also allows users to describe application-specific annotations over the subsections of these spaces. Lines 4-7 show an example of string-based annotation over subsections of the construction range. This allows users to easily access different portions of an index space, thereby enabling access to slices of a tensor and constructing new tensors using slices of the original index space. Similarly, users can encode spin information related to each dimension over the input tensors, allowing our run-time to allocate these 1 // construct tN labels tensors using a block-sparse representation. For the time being, we only support spin symmetry for representing the block sparsity where the zero-blocks are not stored. Users can define their own non-zero check functionality to be used during allocation and tensor operation execution for storage and computational benefits. Additional capabilities, such as restricted summations or point group and permutational symmetries, are not supported in the tensor specification. Point group symmetry is less concerning since TAMM targets large systems that often do not have such symmetry. However, developers can explicitly code these features using TAMM to construct the desired tensors. This is the case for methods such as CCSD(T), discussed in Section IV A 2. TAMM also supports specialized tensor construction, called lambda tensor, where the user can provide a C++ lambda expression that specifies how a block of the tensor is computed, as shown in Line 19. This construct can be used to dynamically generate blocks of a tensor as needed. Note that these tensors are not stored in memory; they are read-only objects computed on the fly to be consumed in various tensor operations described later in this section. Similarly, TAMM also provides a special construct called view tensor to describe access to an existing tensor using C++ lambda expressions. The main use of view tensors is to define tensors of different shapes that can be used as a reference tensor for any operation as well as apply possible sparsity mapping/constraints on a dense tensor. Similar to lambda tensors, a view tensor does not allocate any additional storage but rather provides specialized access to the data available in the referenced tensor. We provide details of a use case for view tensors when discussing the DLPNO-CCSD method in Section IV B 2.
Lines 10-12 show the construction of TiledIndexSpace objects that represent a sliced index space that is used in tensor construction to have a blocked structure. Tiling can be applied as a fixed size (i.e., line 10) or as arbitrary tile sizes with full coverage of the index space (i.e., line 11). Finally, lines 15-17 in Figure 2 show the construction of Tensor objects using tiled index spaces. Each TiledIndexSpace object used in the tensor constructor represents a dimension in the constructed tensors. In this example, tensor A is a 30 × 20 matrix with eight blocks, as denoted by tM, which consists of two tiles of sizes 10 and 20, while tK consists of four tiles that are size 5. While these lines construct a tensor object, the tensor is collectively allocated by all participating compute nodes in a subsequent operation using a scheduler.
Another important concept in constructing tensor operations is index labels, which allow specifying tensor operations in familiar mathematical notations and provide slicing capabilities over tensors by using the string-based subsections of the full index space. Labels are associated with tiled index spaces and used in the tensor operation syntax to describe the computation. Depending on the index spaces that the tensors are constructed on, users can specify string-based sub-spaces to define operations on different slices of the underlying allocated tensor object. Figure 3 shows examples of TiledIndexLabel construction using TiledIndexSpace objects. While lines 1-6 construct the labels over the full spaces, line 8 shows the label creation for the first portion of the tK index space (see construction on Figure 2 line 4). These sub-spaces can then be used for specifying operations over sliced portions of the full tensor as long as the index labels are from a subset of the original index space.
TAMM supports several tensor operations: tensor set, tensor addition, subtraction, scale, trace, transpose, general tensor contractions, inner and outer products, and reduction. TAMM supports different tensor data types and several mathematically valid operations between tensors of different data types (e.g., tensor contraction between a complex and real tensor). Figure  4 gives the grammar for allowed tensor operations' syntax in the TAMM framework. Each tensor operation syntax rule (⟨tensor-op⟩) is composed of a left-hand side (lhs, ⟨op-lhs⟩) and a right-hand side (rhs, ⟨rhs⟩) operation. While lhs can only be a labeled tensor construct (⟨label-tensor⟩), rhs can be of different types that result in different tensor operations, as follows: • alpha value [A(i,l) = alpha], which corresponds to a tensor set operation that assigns the corresponding alpha value to all elements of the tensor.
• labeled tensors [A(i,l) += alpha * D(l,i)] correspond to a tensor addition operation (with respect to the label permutation on tensor D) in Eq. (7).
• contraction of two labeled tensors [C(i,a) += alpha * A(i,l) * B(l,a)] updates the lhs with the tensor contraction results in Eq. (8).
Several additional operations on tensors are provided as utility routines in TAMM. Element-wise operations such as square, log, inverse, etc. are provided. Additional useful operations on tensors, such as min, max, norm, etc., are also provided, including high-level routines to read from and write tensors to disk using parallel file I/O. All these operations are defined using a parallel block_for construct provided by TAMM, allowing users to easily define custom element-and block-wise operations on a tensor. The block_for construct allows userdefined operations (using C++ lambda expressions) that are executed in parallel on distributed tensors.
Similar to tensor allocation, all other tensor operations that use and modify the distributed tensor data have to be performed via a scheduler that has the necessary information about the execution context of the operation. Execution context includes required information about the distribution scheme of the tensor data and the execution medium for the operations through distribution, memory manager, and process group constructs. Note that the tensor operations that are provided as utility routines do not use the scheduler, but still require the execution context to perform the corresponding operation. TAMM employs a modular construction of these concepts to allow users to implement various distribution schemes as well as different distributed memory frameworks. Figure 5 shows the construction of a Scheduler object using the execution context and the execution of defined tensor operations. After creating a scheduler, users can directly queue tensor operations such as tensor allocation ( line 12), tensor set and add operations (lines 13 and 14), and tensor contractions (line 15). Finally, all queued operations are executed using the execution context on a distributed system. The tensor operations, as well as the operations over the index spaces, are formally described in our previous work. 44 The syntax for expressing operations shown in lines 13-15 also indicates the productivity benefits that can be obtained by using TAMM. The operations expressed in these three lines are executed in parallel on CPUs, GPUs, or both on any distributed computing platform. Manually writing parallel code for these operations would lead to a significant increase in the number of source lines of code (SLOC). Extending such manual development of parallel code to a real application with a large number of such operations would only lead to a significant increase (orders of magnitude) in the SLOC count and development time which would also make future improvements to such code infeasible.
This section summarizes the tensor algebra interface as an embedded domain-specific language (eDSL) in the TAMM framework. By implementing an eDSL, we were able to separate concerns for developing scientific applications. While the high-level tensor abstractions allowed domain scientists to implement their algorithms using a representation close to mathematical formulation, it also allowed framework developers to test various different optimization schemes on the underlying constructs (i.e., different data distribution schemes, operation execution on accelerators, use of different PGAS systems, etc.), which we will detail in the coming section.

B. Tensor distribution and operation execution in TAMM
TAMM leverages various state-of-the-art frameworks and libraries to achieve a scalable performance-portable implementation of tensor algebra operations on exascale supercomputing platforms through efficient data distribution and intra-node execution of tensor operation kernels on CPUs and GPUs. The default memory manager for tensor data distribution in TAMM is based on the Global Arrays framework, a Partitioned Global Address Space (PGAS) programming model that provides a shared memory-like programming model on distributed memory platforms. Global Arrays provides performance, scalability, and user productivity in TAMM by managing the inter-node memory and communication for tensors. A TAMM tensor is essentially a global array with a certain distribution scheme. We have implemented three tensor distribution schemes in TAMM. The first scheme computes an effective processor grid for a given number of processes. A dense tensor is then mapped onto the processor grid. The second scheme is a simple round-robin distribution that allocates equal-sized blocks in a round-robin fashion where the block size is determined by the size of the largest block in the tensor. This distribution over-allocates memory and ignores sparsity. The third scheme allocates the tensor blocks in a round-robin fashion while taking block sparsity into account. By only allocating non-zero blocks in the tensor, it minimizes the memory footprint of overall computation, allowing bigger-sized problems to be mapped to the available resources.
TAMM uses the "Single Program Multiple Data (SPMD)" model for distributed computation. In this programming abstraction, each node has its own portion of tensors available locally as well as access to the remote portions via communication over the network. As a result, operations on whole tensors can result in access of remote portions on the tensor, with implied communication. More importantly, many operations (i.e., tensor contractions, addition, etc.) are implied to be collective as they involve the distributed tensors as a whole. While the tensor algebra interface follows a sequential ordering of tensor operations, we also tried to conceal the burden of thinking in a distributed manner while writing a scientific application. To avoid possible issues with operations on distributed tensors, TAMM is designed to represent tensors in terms of handles and requires tensor operations to be declared explicitly and executed using a scheduler. Hence, any assignment performed on tensor objects will be a shallow copy as opposed to a deep copy, as a deep copy implies communication (message passing between nodes to perform the copy operation).
The computational chemistry methods are implemented as a sequence of operations over distributed tensors. Through this design, we were able to separate the specification of the operations from their implementations, allowing method developers to mainly focus on the computational chemistry algorithms while kernel developers can focus on the optimization of individual tensor operations. The execution of all tensor operations is managed by a scheduler. The TAMM scheduler employs a data flow analysis over the queued tensor operations to find the dependencies between each operation in order to load balance the computation and communication requirements of the overall execution. Using a levelized execution scheme, the scheduler is able to limit the global synchronizations to achieve load-balanced and communication-efficient schedules for the execution of all tensor operations. The dependency analysis over the high-level operations is based on a macro-operation graph. When two or more operations share the same tensor object and one of these operations updates the shared object, the operations are marked as conflicting operations that can not be executed in parallel. This operation-graph is used to construct a batch of operations that can be executed in parallel, minimizing the total number of global synchronizations required by the computation. The operations in these batches are then executed in an SPMD fashion. For instance, the canonical CCSD implementation in TAMM has 10 levels of operation batches that sum to over 125 tensor operations.
While TAMM hides the burden of choosing the bestperforming schedule from the users through load-balanced scheduler execution, it allows users to control various aspects of the computation, such as data distribution, parallelization strategies, operation ordering, and execution medium (i.e., CPUs, GPUs). To achieve this, TAMM uses a modular structure to describe constraints imposed by the users to automate the construction of an execution plan for efficient execution of the computation. This allows users to incrementally optimize their code with minimal programming effort while keeping it readable as opposed to a code generator-based solution. With these controls over the execution and distribution of the tensor data, users can choose from different optimizations at different granularities. For example, the user can increase the data locality by replicating frequently used small tensors on each distributed node or choosing from different distribution schemes for efficient tensor operation execution on various node topologies.Such an optimization can be implemented as a new data distribution scheme (i.e., SUMMA 45 ) by extending the Distribution class. By simply using this distribution, users can enforce a specific distribution on tensors which can optimize required data communication for specific operations. Another abstraction for optimization is the actual computation of the tensor operations. Users can define new operations by extending the Op class which can then be scheduled to be executed. While TAMM supports the most common operations defined on tensors (i.e., addition and contraction), it also implements a Scan and Map operation that can be used to define various element/block-wise operations using lambda functions. Additionally, as a lower-level abstraction, users can also decide to describe new executions specific to a new architecture/accelerator on the kernel level (i.e., DGEMM, GPU abstraction). While TAMM supports main GPUs (i.e., Nvidia, AMD, and Intel) that will be available in upcoming HPC systems, users can also choose to implement new kernel-level abstractions for different hardware stacks (i.e., FPGAs and ML/DL accelerators).
To achieve highly optimized execution of tensor operations on a given platform, TAMM is designed to allow the use of multiple external libraries that provides optimized kernels for various operations. In addition to leveraging vendor-provided linear algebra libraries that are highly tuned for both CPUs and GPUs, TAMM also uses the following libraries: the HPTT 41 library for optimized tensor index permutations on CPUs; Li-breTT 42 for enabling efficient index permutation on Nvidia, AMD, and Intel GPUs; the BLIS 46 library for efficient BLAS implementation in CPU; and finally, TensorGen 47 for generating optimized tensor contraction GPU kernels specialized for CC methods in computational chemistry.
While TAMM tries to hide the execution details from the end-user by employing high-level tensor algebra expressions, users can specify the medium on which they want the operations to be executed. Users can specify a particular operation to be executed on the CPU or GPU. All other executionspecific details like parallelization, CPU-to-GPU communication, and execution mechanisms are handled automatically by the TAMM infrastructure. While TAMM does not explicitly rely on OpenMP, it can leverage OpenMP for important operations like tensor transposition and GEMM via highly optimized external libraries such as HPTT and vendor BLAS. Setting the OMP_NUM_THREADS variable before running any TAMM-based application would enable OpenMP parallelization.

C. Comparison with other tensor algebra frameworks
In this section, we compare the functionality provided by TAMM with that provided by other distributed tensor algebra frameworks -TiledArray, CTF, DISTAL, and ExaTensor. Table I provides the key features and differences between these frameworks. While all these frameworks provide similar features in how tensors and tensor operations can be represented, TAMM is designed to be an extendable framework and hence differs from the remaining frameworks in the following key aspects discussed below: • Indexing and slicing: Indexing and slicing capabilities in any tensor algebra framework are crucial to defining tensor operations. Indexing plays a key role in constructing tensor operations (see Figure 5). Slicing is predominantly used in computational chemistry methods, allowing operations on isolated parts of tensors. Users can choose to write the tensor operations over slices of the tensors instead of having to copy the required portions first for such operations. While the tensor algebra frameworks being compared provide an indexing mechanism, they lack support for using slicing in the tensor operations. TA and CTF use a string-based solution for indexing but lack the capability to support sliced indexing. On the other hand, DISTAL and ExaTensor use object-based constructs to create and use labels in the computation. While DISTAL does not allow any slicing, ExaTensor let users to use pre-defined sub-ranges for slicing. TAMM provides generic indexing and slicing capabilities through various index space operations allowing users to define a subset of indices to create object-based labels (see Figure 3) as well as string-based labels to construct tensor operations. Users can make use of slicing capabilities by using object-based labels over sub-ranges or by creating new slices on-the-fly or use string-based labels to define operations over full tensors.
• Sparsity: Sparsity is widely used in several computational chemistry applications. All of the tensor algebra frameworks discussed in this section support some notion of sparsity for both data and computation. While DISTAL does not have direct support for sparsity, later work by the same authors, namely SpDISTAL, incorporates different sparse data structures as well as specifications to describe the storage and computation of sparse tensors. SpDISTAL supports a generic specification for describing the dense/compressed format for each individual mode as well as pre-defined sparse data representations (i.e., CSR, CSC etc.) for describing such tensors.
Operations over these tensors are executed by generating custom kernels. To contrast, more application-oriented tensor libraries, such as TA, CTF, and ExaTensor, allow block-level sparsity. TA allows users to define non-zero and zero tiles that construct the tensors, whereas CTF uses pre-defined symmetry-based sparsity over the tensors. TAMM, on the other hand, incorporates sparsity by providing attributes over the tiled index spaces. By default, TAMM implements spin symmetry-related attributes to allocate block sparse tensors depending on the values of corresponding attributes on the tiles of the index space. Internally, TAMM analyzes the tensor blocks to determine if the corresponding block is non-zero to optimize storage and computation on block sparse tensors.
• GPU Support: It is crucial to effectively leverage the available resources in modern heterogeneous HPC platforms. TAMM currently enables tensor algebra operations (mainly tensor contractions) on multiple GPU architectures (AMD, Nvidia, and Intel). At the time of this writing, the remaining frameworks are limited to Nvidia GPUs (except ExaTensor, which has AMD GPU support). The unified GPU infrastructure design allows TAMM to be easily extended to add support for a new hardware accelerator. Support for the upcoming GPU architectures with performance portability was also explored using SYCL, a domain-specific, heterogeneous programming language. Using SYCL and oneAPI Math Kernel Library (oneMKL) interfaces, 48 perturbative triples contribution from the CCSD(T) formalism was demonstrated with large-scale simulations on different GPU architectures. 49 • Extendable backends: TAMM currently uses Global Arrays (GAs), which is built on top of MPI, for managing all aspects of data distribution and communication. TAMM also has an experimental UPC++ 50 backend that can alternatively be used instead of GA. TiledArray(TA) uses the MADNESS 38 parallel runtime as the backend; DISTAL uses the Legion 51 programming model and runtime system for distributing the data and computing; CTF and ExaTensor use MPI as the underlying parallel programming model. TAMM is designed to be easily extendable by implementing the methods in the process group and memory manager classes to support other parallel programming models. To the best of our knowledge, the remaining frameworks are not easily extendable to add support for additional parallel programming models.
• Distribution: A TAMM tensor is allocated with a distribution type specified via the execution context. The distribution choices that are supported can be extended via the distribution class. DISTAL allows the specification of tensor distributions using a notation that allows the user to specify a process grid that the dimensions of a tensor can be mapped onto. Tensor distributions in TA and CTF are based on several parallel matrix multiplication algorithms (SUMMA, 2.5D, and 3D) 26 . TA additionally gives users the option of specifying a process map onto which tiles are distributed. However, CTF cannot be easily extended to allow other distributions.
• Scheduling and Execution: DISTAL allows users to specialize computation to the way that data are already laid out or easily transform data between distributed layouts to match the computation, allowing fully customizable execution strategies for each operation. Several parallel matrix multiplication algorithms from the literature 26 are expressible as schedules in DISTAL, which is currently not possible with any other framework in comparison. TA and CTF automatically execute tensor operations by dynamically redistributing data across operations if needed but do not provide any way to customize the operation scheduling process. TAMM allows deferring execution for a collection of tensor operations. When the user explicitly calls the execute function, the TAMM scheduler analyzes the dependencies between the various operations, and schedules them appropriately in order to load balance the computation and communication requirements of the overall execution across the collection of operations. However, TAMM does not provide any further customization to the scheduling process. The trade-off between these frameworks is that TAMM, TA, and CTF fully automate the distribution process, while users must explicitly provide a schedule to distribute their computations with DISTAL.
In this section, we have provided a detailed explanation of the TAMM framework. TAMM leverages a modular infrastructure to enable the implementation of various optimizations on different levels of computation, from data distribution to execution schemes on different hardware. This design allowed us to implement different memory managers, distribution schemes, and work distribution over different process groups without any major changes to the user-facing tensor algebra interface. Finally, we provided a feature-based comparison between TAMM and other distributed tensor algebra frameworks, namely TiledArray, CTF, DISTAL, ExaTensor.

IV. CC TAMM IMPLEMENTATIONS
In this section, we present case studies where TAMM was used to implement various scalable coupled-cluster (CC) methods for the latest HPC systems. While important, TAMM's primary contributions are not just the faster performing versions of these methods but also the ability to productively develop and explore new algorithms, and apply those improvements to all existing and new applications implemented using TAMM. This includes improvements in intra-node execution (single core, OpenMP multicore, GPUs, etc.), data distribution strategies (e.g., replication, process group-based distribution), and parallel execution (compute partitioning and communication scheduling algorithms). It also allows the concurrent development of optimized equations, parallel algorithms, and optimized intra-node kernels by different teams through clearly defined interfaces. The variety of methods presented and their performance are evidence that this approach accelerates the development and exploration of CC methods.

A. Canonical Methods
The canonical formulations of the Coupled-Cluster (CC) formalisms [1][2][3][4][5][6]8 are based on the exponential parametrization of the correlated ground-state wave function |Ψ⟩: where T is the cluster operator and the reference function |Φ⟩ in single-reference formulations is assumed to be represented by the Hartree−Fock Slater determinant. In practical realizations, it is assumed that |Φ⟩ provides a good approximation to the correlated ground state, |Ψ⟩. The cluster operator T can be partitioned into it many body components T k defined as where a † p (a p ) are the creation (annihilation) operators for an electron in p-th state and indices i 1 , i 2 , . . . (a 1 , a 2 , . . .) refer to occupied (unoccupied) spin orbitals in the reference function |Φ⟩. The operators T k , defined by the cluster amplitudes t i 1 ...i k a 1 ...a k , produce k-tuple excitations when acting onto the reference function.
To define the equations needed for determining cluster amplitudes, we introduce wave-function expansion (11) into the Schrödinger equation by pre-multiplying both sides by e −T . This procedure leads to an explicitly connected form of the energy-independent equations for amplitudes and energy, i.e., where h p q and v pq rs are tensors representing interactions in the quantum system and excited Slater determinants |Φ In all CC formulations discussed here, to form equations (14) for cluster amplitudes, one needs to (1) find and efficient way for distributing and compressing tensors h p q , v pq rs , and t

Coupled cluster singles doubles (CCSD)
The CCSD formalism (CC with single and double excitations) 4 is one of the most popular CC approximations and is used in routine computational chemistry calculations as a necessary intermediate step toward more accurate CC models, such as the perturbative CCSD(T) formalism discussed in Section IV A 2, or excited-state or linear-response CC extensions. In the CCSD formalism, the cluster operator T is approximated as and the equations for cluster amplitudes are represented as where the tensors r a i and r ab i j are commonly referred to as the residual vectors. Due to the large number of terms corresponding to complicated contractions between h p q /v pq rs and t i a /t i j ab , optimization of the expressions plays a crucial role. This is achieved by proper factorization by introducing the so-called recursive intermediates. For example, 1 4 v e f mn t i j e f t mn ab (21) term, which contributes to r ab i j in the naive approach is characterized by n 4 o n 4 u numerical overhead. However, by introducing the intermediate tensor I i j mn defined as (we assumed Einstein summation convention over repeated indices) the term (21) can be given by the equation at the total numerical cost proportional to n 4 o n 2 u . Equation 23 can be expressed using the TAMM interface as follows, (I(m, n, i, j ) = v(m, n, e, f ) * t(e, f , i, j )) (r(a, b, i, j ) + = 0.25 * I(m, n, i, j ) * t(a, b, m, n )) (24) The TAMM code in (24) encapsulates distributed execution of the tensor contractions on CPUs, GPUs as well as the choice of the parallel programming model used by TAMM for managing distributed tensor data. Similarly, all other CCSD equations can be represented as tensor operations using the TAMM interface. As indicated in (24), the total number of lines of code is greatly reduced in comparison to hand-coded implementations where one needs to explicitly manage data distribution using a particular programming model and also aspects of execution on various heterogeneous architectures. On the other hand, the TAMM representation of such equations makes it easy to develop and maintain scalable electronic structure methods (such as CCSD) that can be run on a variety of HPC platforms.
Another important technique that is useful in reducing the memory footprint of the CCSD approach is the Cholesky decomposition (CD) of the v pq rs tensor, i.e., v pq rs ≃ (pr|L)(L|qs) − (ps|L)(L|qr) , where L is an auxiliary index. The total memory required to store Cholesky vectors (pq|L) needed to reproduce v pq rs with high accuracy is usually proportional to (n o + n u ) 3 .
We developed both closed-shell and open-shell implementations of the CCSD equations using TAMM. The use of TAMM index spaces for alpha and beta-occupied and virtual orbitals allowed us to represent the same theory optimized for specific problems easily. An example equation that is a contribution to the intermediate I in Eqn. 24 may appear like the following: (I(m_a, n_a, i_a, j_a ) = v(m_a, n_a, e_b, f _b ) * t(e_b, f _b, i_a, j_a )) , (26) where the added _a and _b notations respectfully refer to the corresponding indices for the alpha and beta sub-spaces. While the equations for the closed-vs open-shell implementations are different, the ease of writing them is the same due to the slicing capabilities on sub-spaces provided by TAMM.
As benchmark systems for testing the performance of TAMM implementations of the Cholesky-based CCSD 52 , we used two molecular systems previously used in studying the efficiency of the TCE implementations of the CC methods in NWChem 53,54 . The first benchmark system is the model of Bacteriochlorophyll (BChl) MgC 36 N 4 O 6 H 38 , 55,56 which plays an important role in understanding the mechanism of photosynthesis. In this process, the light is harvested by antenna systems and further funneled to BChl molecules, which initiate primary charge separation. The second benchmark system considered here is the β -carotene molecule, whose doubly excited states and triplet electronic states have recently been intensively studied in the context of singlet fission processes in carotenoid aggregates. [57][58][59] The β -carotene molecule consists of 96 atoms, while the BChl model contains 85 atoms, including a central magnesium atom. We use the cc-pVDZ basis for both systems and evaluate their performance on OLCF Summit. 21 Time per CCSD iteration is given. Each Summit node contains two IBM POWER9 processors, each consisting of 22 cores and 512 GB of CPU memory. Each node is also equipped with six Nvidia Volta GPUs, each with 16 GB of memory, for a total of 96 GB of GPU memory per node. Table II shows the CCSD performance compared to NWChem on 200 nodes of OLCF Summit. We measure the performance of the TAMM implementations against the TCE implementations in NWChem. The time per CCSD iteration is given in seconds. NWChem has a CPU-only implementation and uses 42 CPU cores on each node of Summit. TAMMbased Cholesky-CCSD uses only six MPI ranks per node, where each MPI rank is mapped to a single GPU. For these two molecular systems, we observe a 9-15× speedup with the TAMM-based Cholesky-CCSD implementation compared to the TCE CCSD method in NWChem. The CPU implementation of the CCSD tensor operations in NWChem comprises  11,314 source lines of code whereas the Cholesky-CCSD implementation expressed using the TAMM framework is only 236 source lines of code. Since these 236 lines represent computation at a high level, they express both CPU and GPU computation. On the other hand, adding GPU capabilities to the NWChem CCSD code will only significantly increase the SLOC count and development time which is why a GPU implementation for CCSD has not been attempted to date in NWChem. This clearly demonstrates the productivity benefits of expressing such computations in TAMM. Our implementation of the CCSD equations 52 is expressed similarly to as the example in Figure 5. CCSD is an example of how TAMM can be used to productively create an effective and efficient implementation of certain classes of computational chemistry methods. The remaining computational chemistry methods discussed in this paper are also implemented along similar lines.

Coupled cluster triples
The CCSD(T) formalism 9,61 is capable in many cases of providing the so-called chemical level of accuracy required in studies of chemical reactivity and thermochemistry. In the CCSD(T) approach the perturbative correction due to triple excitations (E (T) ) is added to the CCSD energy: where where V N is the two-body part of the electronic Hamiltonian in a normal product form, and |Φ abc i jk ⟩ = a † a a † b a † c a k a j a i |Φ⟩. The most expensive part of the CCSD(T) calculation, characterized by the n 4 o n 3 u + n 3 o n 4 u scaling, is associated with calculating the ⟨Φ abc i jk |V N T 2 |Φ⟩ term, which is defined as ⟨Φ abc i jk |V N T 2 |Φ⟩ = v i j ma t mk bc − v i j mb t mk ac + v i j mc t mk ab − v ik ma t m j bc + v ik mb t m j ac + v ik mc t m j ab + v jk ma t mi bc − v jk mb t mi ac + v jk mc t mi ab − v ei ab t jk ec + v ei ac t jk eb − v ei bc t jk ea + v e j ab t ik ec − v e j ac t ik eb + v e j bc t ik ea − v ek ab t i j ec + v ek ac t i j eb − v ek bc t i j ea , (i < j < k, a < b < c). (29) Eq. (29) can be separated into terms A abc i jk and B abc i jk , defined by contractions over occupied indices (A abc i jk ; the first nine terms on the right hand side (r.h.s.) of Eq. (29)) and terms corresponding to contraction over unoccupied indices [B abc i jk ; remaining nine terms on the r.h.s. of Eq. (29)], i.e., Analogously, the ⟨Φ abc i jk |V N T 1 |Φ⟩ term takes the form While Eqs. (29) and (31) can be easily implemented with the tensor operations mentioned in Sec. II A, the performance of the (T) correction, in terms of both speed and memory, is improved by encoding key symmetries by hand while still using the distributed tensor data structure provided by TAMM. This includes restricted index summation and permutational symmetries, which are not routine features in the current version of TAMM. In addition, the contractions in (T) are all fused so that the resulting output is a scalar instead of a six-dimensional intermediate tensor. Currently, TAMM does not support fusion across contractions writing to the same output tensor, but this capability is considered for the near future. Furthermore, the hand-coded kernels for (T) for various GPU architectures are required for the best performance. In addition to fusion, we are also considering incorporating optimized GPU kernels in TAMM for specific contractions such as the ones in (T). The scheduler can then detect such contractions and choose the corresponding optimized kernels for execution rather than executing the contractions through the current default pipeline. Table III shows the TAMM-based NWChemEx triples correction (T) calculation 52 performance compared to NWChem on 512 nodes of OLCF Summit. Time is given in seconds. NWChem has a GPU implementation of the triples correction and uses all 6 GPUs and 42 CPU cores on each node. The TAMM-based triples correction uses only six MPI ranks per node, where each MPI rank is mapped to a single GPU. For the BChl and β -carotene molecules, we observe a speedup of ∼6-18× for the TAMM implementation compared to the TCE implementation in NWChem. The finer details of the TAMM-based triples correction implementation are detailed in Ref. 62.

Equation-of-motion coupled cluster formalism
The equation-of-motion (EOM) methods [63][64][65][66][67] and closely related linear response (LR) CC formulations [68][69][70] can be viewed as excited-state extensions of the single-reference CC theory. In the exact EOMCC formalism, the wave function for the K-th excited state |Ψ K ⟩ is represented as where R K is the excitation operator, which produces the Kth excited state when acting onto the correlated ground-state in CC representations. The energy of the K-th state and the amplitudes defining R K operator can be calculated by solving non-Hermitian eigenvalue problem where the similarity transformed HamiltonianH is defined as In the rudimentary EOMCCSD approximation (EOMCC with singles and doubles), the R K and T operators are approximated as Similar to the CCSD method, the numerical scaling of the EOM-CCSD approach is dominated by n 2 o n 4 u terms, where n o represents the number of occupied orbitals and n u represents the number of unoccupied orbitals. The major computational complexity in the EOMCC method arises from tensor contractions involving various operators. The method involves two types of excitation operators: the CC operator T and the EOM operator R, besides the one-body and two-body components of the Hamiltonian operator. The EOMCCSD approach is a non-Hermitian eigenvalue problem, and because of the dimensionality of the problems, they are solved iteratively using a multi-root solver. Therefore, there is an additional cost for constructing and working with the iterative subspace. In terms of implementation, however, the same TAMM format used for writing various tensor contraction equations, as demonstrated while discussing Equation 23, was followed for the EOMCC method.
The EOMCCSD method is usually employed in studies of excited states dominated by single excitations. It is also worth mentioning that LR-CC methods with singles and doubles leads to the same values of excited-state energies. However, in contrast to the EOMCCSD formalism, the LR-CCSD excitations are identified with the poles of frequency-dependent linear-response CC amplitudes. Table IV shows the TAMM EOMCCSD calculation performance compared to NWChem on 200 nodes of OLCF Summit. The time per EOMCCSD iteration is given in seconds. NWChem has CPU-only implementation and uses 42 CPU cores on each of the 200 nodes of Summit. Just as with the previous calculations, the TAMM-based EOMCCSD uses 6 MPI ranks per node, where each MPI rank is mapped to a single GPU. The current TAMM implementation of the EOMCCSD approach has not been optimized at the same equation level as the CD-CCSD implementation. For example, in contrast to the CD-CCSD formalism, the EOMCCSD implementation fully represents two-electron integrals and uses a non-spinexplicit representation of the operators in the equations. While we plan to return to this implementation and incorporate the optimized (spin-explicit) equations, we already observe a significant speed-up over NWChem with this primitive implementation. CD-CCSD and EOMCCSD calculations are different algorithms for solving the corresponding problem, and the timing for an iteration of EOMCCSD is a sum of all of the steps in an interaction, which tends to be more computationally demanding than the Jacobi step in CD-CCSD. Nonetheless, one can see from Table IV that the TAMM EOMCCSD code is 2-3 times faster for β -carotene and BChl molecules.

DLPNO CCSD(T)
The development of reduced-scaling quantum chemical methods became a significant part of the recent research effort. In particular, for coupled cluster approaches, the local domain-based methods have been introduced and successfully applied on large systems. Using even small computational clusters, the (DLPNO-CC) [14][15][16][17][18]71 formalism can be used for systems described by 10,000-40,000 basis set functions. The limit for system size can be further extended by utilizing parallel exascale architectures. However, to achieve this goal, several computational challenges associated with the data representation, distribution, and optimization of the DLPNO-CC equations (characterized by a large number of contractions and tensors involved in the underlying equations) have to be appropriately addressed. Our implementation is inspired by the work cited in Ref. 16. The basic idea is to carefully take advantage of the local character of correlational effects. The total CCSD energy can be seen as a sum of i j-pair specific contributions ε i j : In order to significantly reduce scaling, we are considering the following tasks for the NWChemEx DLPNO CCSD(T) implementation using TAMM: 1. Differentiate pairs with respect to their energy contributions ε i j by sequential pre-screenings. Identify i j-pairs that a) are negligible and immediately dropped, b) can be evaluated at a lower-level model (MP2 level), and c) can be evaluated at CC level.
2. Find an optimal representation of the virtual space for each i j-pair, in which the derived tensors (amplitudes, residuals, or integrals) are dense.
3. Find the optimal factorization of terms in amplitude equations.
4. Perform only those tensor contractions that lead to nonzero results (including integral transformation and amplitude equations evaluation).
In the local pair natural orbital CCSD method, the occupied orbitals are localized (for example, by the Pipek-Mezey or Foster-Boys algorithms 72,73 ), and the virtual space is constructed specifically for a given occupied orbital pair. First, the virtual orbitals are transformed to a local basis of projected atomic orbitals (PAO) |μ⟩ as |μ⟩ = (1 − ∑ i |i⟩⟨i|)|µ⟩, [74][75][76] , which are a priori local and can be easily used to form domains corresponding to local occupied orbitals. In the next step, pair natural orbitals (PNOs) are constructed. 15,16,[77][78][79][80] Therefore, in Eqs. 19 and 20, instead of virtual index a, b, . . . we will get pair-specific PNOsã i j ,b i j , . . . where we assume that the size of PNO space N(PNO(i j)) ≪ N(MO virt ). The PNO spaces are obtained from pair density matrices D i j where T i j are MP2 amplitudes in PAO basis where f ii and f j j are occupied orbital energies, and εμ or εν are PAO orbital energies obtained by diagonalization of the Fock matrix transformed to PAO space. The transformation matrices d i j transforming orbitals from PAO space to PNO space, and occupation numbers n i j correspond to eigenvectors and eigenvalues of the diagonalized matrix D i j , The introduction of i j-specific PNO spaces, which are mutually non-orthogonal, leads to more complicated expressions in the amplitude equations because we also need to involve overlap matrices between different PNO spaces. The matrix transforming i j-PNO space to kl-PNO space S i j;kl a i jbkl is defined as where S PAO is the PAO overlap matrix. Therefore, when considering the PNO space, the term in Eq. 21 can take the form However, it is also possible to transform the e, f indices from i jto mn-PNO space first and then to perform the contraction of the integral with the first amplitude. The complex structure of these terms significantly expands the possibilities of how they can be factorized. The cost is also affected by the integral evaluation, which depends on the available memory. We can pre-compute not only vẽ i jfi j i j type integrals, but also some mixed PNO space integrals (only those which that be needed).
In our work, we assume that we will utilize a larger number of nodes, so we will have more memory and computing effort available. In that case, we can afford tighter thresholds, leading to larger domains and PNO spaces, which means higher precision in the recovery of the correlation energy. For the implementation of DLPNO formulations in TAMM, we employed a code generator implemented in Python that transforms the canonical equations by the transformation rules for various spaces (i.e., PNO, PAO, etc.). Using a code generator allowed us to systematically convert equations while automatically applying an operation cost minimization algorithm for single-term optimization. We anticipate that these kinds of code generators on the high-level equations will enable trying different transformations and automatically choosing the best performing one. While this implementation is in the early stages, we were able to validate the correctness of the generated code using our infrastructure to directly compare the results with their canonical counterparts.
The perturbative correction (T) described in Sec. 4.1.2 is in the local version evaluated using the same equations with some differences. 81 While converged T 1 and T 2 amplitudes are obtained in their PNO space, the virtual indices in terms A abc i jk or B abc i jk are represented in triples natural orbitals (TNOs), which are computed from triplet density matrices obtained from three pair density matrices, D i jk = 1/3(D i j + D ik + D jk ). In order to perform contractions of integrals and amplitudes in Eqs. 29 and 31, we need to involve transformation matrices between PNO and TNO spaces S i j;klm , which are computed the same way as in Eq. 41 where d kl is replaced by d klm , transforming PAO orbitals to TNO space. Similar to the DLPNO CCSD implementation, we leveraged the TAMM framework's dense tensor infrastructure to represent the perturbation correction implementation with block-sparse computation. Using the PNO representation of the amplitudes from the CCSD implementation, we implemented the DLPNO formulation of the canonical equations.
For the development of DLPNO-based methods in NWChemEx, we utilized specialized view tensor construction to avoid redundant storage of the transformation tensors,  Figure 6 shows an example use case specific to DLPNO CCSD equations. In this example, we are using a view tensor over the large Sijkl transformation tensors. Depending on the pair indices used within the equations, various kinds of transformations are required. In this example, the constructed Sijki tensor (line 17) has values from Sijkl if the first index of the first pair matches the second index of the second pair. If not, the corresponding indices are set to zero. In this case, there was no need to translate the blocks or update the tensor since the tensor was used as read-only. However, users can provide special constraints on both the translation of the blocks and access to these blocks. By using this feature, we were able to reduce the memory footprint of DLPNO methods and add additional constraints to leverage sparsity within the computation. Since TAMM allows mixed usage of specialized tensors with other tensor types in all supported operations, the representation and execution patterns of the DLPNO CCSD equations in TAMM did not have to change in such scenarios.

Time-dependent coupled-cluster method
In addition to the stationary and frequency-dependent formulations of CC theory, one could witness significant progress on our end in developing a time-dependent CC method for simulating real-time dynamics.
The TDCC method has been studied in the context of the CC linear-response theory, 68,69,[82][83][84] x-ray spectroscopy and Green's function theory, [85][86][87][88][89] nuclear physics, 90-92 condensed matter physics, 93 and quantum dynamics of molecular systems in external fields. [94][95][96][97][98][99] These studies have also initiated an intensive effort toward understanding many aspects of the TDCC formalism, including addressing fundamental problems such as the form of the action functional, the form of the time-dependent molecular orbital basis, the physical interpretation of time-dependent orbitals, various-rank approximations of the cluster operator, and the numerical stability of time integration algorithms. One of the milestone achievements in developing a time-dependent CC formalism was Arponen's action functional for the bi-variational coupled cluster formalism 93 and the following analysis by Kvaal considering time-varying orbitals. 95 The time-dependent Schödinger equation (TDSE), is an initial value problem, which has a formal solution |Ψ(τ)⟩ = e −iHτ |Ψ⟩, with the initial condition, |Ψ(0)⟩ = |Ψ⟩. An efficient way of finding approximate solutions of the TDSE for many-electron systems is associated with the utilization of the time-dependent CC (TD-CC) Ansatz Upon plugging TD-CC Ansatz into Eqn 43, and projecting onto the excited determinants, one obtains where T (τ) is a time-dependent cluster operator, |Φ⟩ is a time-independent reference wave function in our consideration, and H(τ) = e −T (τ) He T (τ) , is the similarity transformed timedependent Hamiltonian, which is the generator for the time evolution.
The similarity transformed CC Hamiltonian is non-Hermitian. Therefore, the computation of observables in this framework requires a bi-variational approach, 65,93 i.e., both the bra and ket states must be varied independently. The expectation value of any observable in the TDCC framework is defined as where Λ is a de-excitation operator. For obtaining equations for the Λ, we write down the TDSE, −i∂ τ ⟨Ψ ′ (τ)| = ⟨Ψ ′ (τ)|H, for the bra vector, with H = H † , Multiplying by e T (τ) from the right, and right projecting to the excited determinants obtains The right-hand side of Eqn. 45, is the same as the stationary CC amplitude equations and hence implemented using TAMM, similar to how it is implemented for the stationary theory as shown in Sec. IV A 1. However, the time-dependent CC amplitudes are naturally complex since the underlying equation we are solving is the TDSE. Furthermore, one-body and two-body Hamiltonian matrix elements are real-valued quantities. TAMM support for mixed complex-real tensor operations helps avoid the explicit handling of the real and imaginary data 1 // src , dest tensors are given can be expressed exactly the same way as in the stationary CC theory as shown in Eqn 24. However, in this case, R and t are complex-valued, while v is a real-valued quantity.
We have propagated the time-dependent CC amplitudes using a first-order Adams-Moulton method, which is an implicit time-propagator. This algorithm requires copying and swapping of the real and complex parts of time-dependent amplitudes in each micro-iterations. TAMM does not directly support swapping the real and imaginary parts of each element of a complex tensor. However, in this case, we used the block_ f or construct of TAMM to perform the swap in parallel. A specialized C++ lambda expression, as shown in Figure 7, is provided to the block_ f or, which simply specifies that the real and imaginary parts of each element in a given block need to be swapped and copied to (or updated) the corresponding element in the destination tensor block.
The TDCC method performs time propagation for a sufficiently long duration to obtain well-resolved spectra. However, obtaining computing time for such an extended period in a single run is unlikely when using shared computing resources. In addition, it is highly beneficial to track simulations to adjust and optimize various parameters in between different runs of the same simulation. The ability to checkpoint and restart simulations, hence, plays a significant role in the efficient utilization of computing resources. The TAMM library provides parallel file I/O operations on tensors, which helps with checkpointing and restarting the TDCC calculations. We simply use the TAMM interface to write and read tensor data to and from the disk, in as shown in Fig. 8. The underlying TAMM infrastructure handles the parallelization aspects of these I/O operations.
The TAMM infrastructure has been utilized to implement the time evolution of the T (τ) operator. We have considered singles (S) and doubles (D) excitation approximations in our implementation. Our implementation uses Cholesky-decomposed 1 tamm:: write_to_disk ( tensor , tensor_filename ) ; 2 tamm::read_from_disk( tensor , tensor_filename ) ; two-body electron repulsion matrix elements, 100,101 and we exploit spin-explicit formalism to evaluate tensor contractions of matrix elements of various operators.

Coupled cluster Green's function
The correlated formulations of Green's function methods are indispensable elements of the computational infrastructure needed not only to calculate ionization potentials, electron affinities, and excitation energies but also as quantum solvers for various embedding formulations. The CC formalism provides a natural platform for the development of the one-body Green's function and the introduction of high-rank correlation effects. [102][103][104][105] Without loss of generality, here we will focus only on the retarded part of Green's function (the advanced part can be developed in an analogous way) defined by matrix elements G R pq (ω): where E 0 is the corresponding ground-state energy for the Nelectron system, η is a broadening factor, and |Ψ g ⟩ represents the ground-state of the N-electron system. In CC we are using different parametrizations for the bra (⟨Ψ g |) and ket (|Ψ g ⟩) wave functions, 65,93 i.e., which leads to the following form of retarded part of the CC Green's function (CCGF) 102-107 The similarity transformed operators here A (A = H, a p , a † q ) are defined as A = e −T A e T (the H N stands for a normal product form of H). The numerically feasible algorithms for calculating (53) employ ω-dependent auxiliary operators X p (ω) X p (ω) = X p,1 (ω) + X p,2 (ω) + . . .
that satisfy equations Using these operators matrix elements can be expressed in a simple form In our implementation of CCGF formalism in the SPEC project, we approximate Λ operator by T † . The main numerical effort associated with constructing a retarded CC Green's function is associated with the need to solve a large number of independent linear equations, which in turn can contribute to efficient parallel schemes utilizing multiple levels of parallelism. Our CCGF algorithm consists of an outer loop that checks for the convergence of the entire calculation. We refer to iterations of this loop as levels. Each level mainly consists of two loops which are the most computationally intensive part of the entire calculation. The first loop goes over the frequencies (ω ′ s) sampled using the adaptive midpoint refinement strategy. 108 The second loop goes over all the orbitals (p ′ s) and the CCGF singles and doubles equations are solved for all (ω, p) pairs for a given level in this loop. Since the bulk of the computation here is associated with high-dimensional tensor contractions, we express the tensor operations similar to the cases in Fig. 5 and Eqn. 24, even though the tensor operations in CCGF involve a mix of complex and real data types. This demonstrates the uniform operation representation provided by TAMM for operations on mixed data types. We refer to the list of all (ω, p) pairs in a given level as CCGF tasks each of which can be executed independently. Each task solves the CCGF singles and doubles equations until convergence for a given (ω, p) pair. All the p orbitals for a given ω are divided across process groups that are set up at the beginning of each level. Currently, the user is responsible for explicitly creating MPI process groups and using them to construct the TAMM process group and execution context objects which are in turn used for executing each task. Each TAMM process group now contains a subset of the resources (processes) provided to the CCGF calculation. In our CCGF implementation, the size of a process group for computing each task is determined automatically for a given problem size (i.e., number of occupied and virtual orbitals) and the resources (total number of nodes, processes, and GPUs per node) provided for that run. All process groups are created to be the same size. We are actively working on providing abstractions in TAMM for expressing computation to be divided across process groups without the user having to create them manually using MPI. Once the execution context objects representing different process groups are created, all CCGF tasks are distributed across the available process groups and executed using different execution contexts.
The resulting tensors from each task are stored on disk using the parallel file I/O routines provided by TAMM for writing and reading distributed tensors to and from the disk, as shown in Figure 8. This enables restarting the CCGF calculation at any point in the calculation. If the set of tasks in a given level is not completed, the subsequent CCGF run upon restart will create process groups only for the remaining tasks and execute them. Another feature provided by TAMM used in CCGF is batched parallel file I/O operations. Several tensors for each task are written to disk as part of the CCGF calculation. The number of tensors written to disk grows with increasing CCGF problem size and the number of frequencies (ω) to be solved. Even though each tensor is read/written using parallel file I/O, the operation uses a small subset of available nodes/processes 1 tamm:: write_to_disk ( tensor_list , tensor_filenames_list ) ; 2 tamm::read_from_disk( tensor_list , tensor_filenames_list ) ; Figure 9: Batched parallel file I/O operations in TAMM to perform the read/write operation for a single tensor. This is because the entire set of available resources used for the CCGF calculation is not required to read/write even large tensors from/to disk. This leads to resource under-utilization since most of the resources are idle while a subset of them are performing the parallel file I/O operation on one distributed tensor at a time. To address this issue, TAMM provides high-level batched tensor I/O routines that can read several hundred tensors from disk concurrently by automatically dividing the total available resources into smaller process groups. Since all tensors that need to be read/written are not necessarily of the same size, variable-sized process groups are dynamically created based on the sizes of each tensor in a given list of tensors. Each of these process groups reads/writes a tensor using parallel file I/O within the group. Several tensors are read/written concurrently by different process groups, leading to significantly better resource utilization and overall improvement in the total time spent in file IO. The user would express such an operation in TAMM as shown in Figure 9. The operation is expressed exactly the same way as one would express it for reading or writing a single tensor (e.g., Figure 8), with the only obvious difference being that a list of tensor handles and corresponding filenames need to be provided in this case. Reference 109 describes the highly scalable CCGF implementation developed using TAMM in detail, along with the performance and scalability analysis on OLCF Summit. The implementation is also publicly available. 110 This section provides an overview of some important CC methods and discusses key aspects of their implementations using TAMM. While frameworks such as TiledArray, CTF, DISTAL,and ExaTensor provide similar features in how tensor operations in CC methods can be represented, TAMM differs from them as explained in Section III C. To the best of our knowledge, these are the first complete scalable distributed implementations of the discussed CC methods that can be run on different GPU architectures while also using different parallel programming models.

V. PERFORMANCE COMPARISON WITH OTHER TENSOR ALGEBRA FRAMEWORKS
We have provided feature comparisons with other distributed tensor algebra frameworks in Section III C, where key library features such as tensor-construction and tensor-operation specification, hardware support, and the underlying parallel programming model backends (for distributed data management) are compared. This section details the performance comparison results with other distributed tensor algebra frameworks, namely, TiledArray (TA) and Cyclops Tensor framework (CTF). We did not consider comparisons with ExaTensor since the development page 111 states that the library has pending performance issues as well as numerous problems with existing MPI-3 implementations at the time of this writing.
All our experiments were performed on the National Energy Research Scientific Computing Center (NERSC) Perlmutter supercomputer. The GPU partition was used for all the experiments. Each node in this partition has a 64-core AMD EPYC 7763 CPU, 256 GB of DDR4 DRAM, and 4 Nvidia A100 GPUs, each with 40 GB of HBM2e RAM. Each GPU node is connected to four NICs, and the GPU nodes are connected using the Slingshot 11 interconnect. The TAMM experiments were configured to run with four MPI ranks per node and one MPI rank per GPU. The TiledArray experiments were configured with four MPI ranks per node (one MPI rank per GPU) and two threads per rank since that is the TiledArray developer's recommended configuration for Perlmutter. The CTF implementation provided the best performance when using either 8 or 12 MPI ranks per node (2 or 3 MPI ranks per GPU), depending on the problem size and the number of nodes used in the experiment. For a given dimension size (N) and node count, we ran CTF using both 8 and 12 MPI ranks per node and chose the best timing out of the two. All codes were compiled with GCC 11.2, CUDA 11.7, and cray-mpich 8.1.
We compared the performance of a tensor contraction shown in Equation 57, which is one of the most expensive operations in many CC calculations. For benchmarking purposes, each dimension of the 4D tensors in Eqn. 57 is chosen to be of the same size (N) resulting in a total size of (N 4 ) doubleprecision floating-point elements for each tensor. The input tensors are filled with random data in each case. We used the readily available TA and CTF codes 36? for benchmarking equation 57 and ensured that the ordering of the indices used in the contraction is consistent across the different benchmark codes. The TA benchmark allows specifying the number of blocks for each dimension. For a given dimension size (N) and node count, we ran the TA benchmark several times using varying number of blocks along each dimension in each run and chose the timings for the best performing case for comparing against TAMM.
C(a, b, c, d)+ = A(a, b, m, n) * B(m, n, c, d) The performance results for TAMM, TiledArray, and CTF implementations of equation 57 are shown in Figure 10. Plots for N = 400, 500, 600, and 800 are also shown on different Perlmutter node counts. With an increasing number of nodes in order to strong scale for a given problem size (N), TAMM demonstrated consistent performance improvements. In contrast, TA and CTF exhibited some oscillating behavior. The TAMM implementation strong scales consistently for a given problem size (N), and provides competitive performance compared to TA and CTF, especially for large problem sizes and node counts. Table V shows the minimum number of nodes required on Perlmutter to run each of the three implementations for the tensor contraction shown in (57) for large problem sizes (N = 600 and N = 800). As N increased, the required minimum number for running the benchmark in the cases of TA and CTF  We also compare the performance of the contraction in Equation 57 with COSMA 112 by representing the contraction as a distributed matrix-multiply operation. COSMA is a state-ofthe-art distributed matrix-multiply library that is capable of running on GPUs. Equation 58 shows the equivalent of Equation 57 represented as a distributed matrix-multiply operation in COSMA, We use the COSMA provided distributed matrix-multiply miniapp. 113 The size of each dimension for the COSMA benchmark was chosen as N 2 in order to match the overall sizes for each tensor (i.e. (N 4 )) in Equation 57. Table VI shows the values for N considered in the comparison. At the time of this writing, we observe that COSMA, like the TiledArray and CTF implementations, also requires a certain minimum number of nodes, which is significantly higher than what is required by the corresponding TAMM implementation for a given problem size (N). We observe that the TAMM performance is nearly identical to COSMA for N = 400 and about 10% better than COSMA for N = 500. For larger problem sizes N = 600 and N = 800, the COSMA implementation ran out of memory on 300 and 400 nodes, respectively. The DISTAL 39 work also provides a performance comparison with COSMA. The DIS-TAL authors mention that COSMA generally achieved about 10% higher performance in comparison to DISTAL using up to 256 nodes of the Lassen supercomputer. Since the code for DISTAL is not in the public domain at the time of this writing, a direct performance comparison with DISTAL is currently not possible. CCSD benchmark: We implemented our Cholesky-CCSD equations as tensor expressions in TA and CTF to further investigate the performance on a real application. We ran calculations of Ubiquitin-DGRTL ? ? molecule with the aug-cc-pVDZ basis (O=146, V=1096, 7810 Cholesky vectors) on up to 350 nodes of Perlmutter using the best performance parameters for TA and CTF. With TAMM, for a single CCSD iteration on 240 nodes, we observed an 80% speed-up over CTF while TA ran out of memory. On 350 nodes, we observed nearly identical timing with CTF and 2.5x speed-up over TA.

VI. CONCLUSION
We have introduced and discussed the Tensor Algebra for Many-body Methods framework that enables scalable and performance-portable implementations of important computational chemistry methods on modern large-scale heterogeneous high-performance computing systems. We described the TAMM framework in detail by introducing a tensor algebra interface that provides a high-level representation of the tensor algebra operations as an embedded domain-specific language. This interface enables the separation of concerns between scientific application development and high-performance computing development efforts. The domain scientist or scientific application developer can focus on the method development instead of the performance optimization details, whereas the HPC developers focus on the underlying algorithms and optimizations. Later, we presented our modular infrastructure that allows the implementation of different optimizations on tensor data distribution, execution, and scheduling of tensor operations for efficient execution on modern heterogeneous HPC platforms. We also compared the features of TAMM against several other distributed tensor algebra frameworks. Through various case studies, we showcased the performance and productivity benefits of using the TAMM framework for implementing complete ground-and excited-state electronic structure methods that are expressed as operations on tensors. Finally, we evaluated the performance of TAMM against other distributed tensor algebra frameworks and demonstrated it's competitiveness and scalability on large problem sizes and node counts on NERSC Perlmutter.