Integral-direct and parallel implementation of the CCSD(T) method: algorithmic developments and large-scale applications.

A completely integral-direct, disk I/O and network traffic economic coupled-cluster singles, doubles, and perturbative triples [CCSD(T)] implementation has been developed relying on the density-fitting approximation. By fully exploiting the permutational symmetry the presented algorithm is highly operation-count and memory efficient. Our measurements demonstrate excellent strong scaling achieved via hybrid MPI/OpenMP parallelization and a highly competitive, 60-70% utilization of the theoretical peak performance on up to hundreds of cores. The terms whose evaluation time becomes significant only for small- to medium-sized examples have also been extensively optimized. Consequently, high performance is also expected for systems appearing in extensive data sets used, e.g., for density functional or machine learning parametrizations, and in calculations required for certain reduced-cost or local approximations of CCSD(T), such as in our local natural orbital scheme [LNO-CCSD(T)]. The efficiency of this implementation allowed us to perform some of the largest CCSD(T) calculations ever presented for systems of 31-43 atoms and 1037-1569 orbitals using only 4-8 many-core CPUs and 1-3 days of wall time. The resulting 13 correlation energies and the 12 corresponding reaction energies and barrier heights are added to our previous benchmark set collecting reference CCSD(T) results of molecules at the applicability limit of current implementations.

We report a CCSD(T) algorithm and implementation designed for the above three goals.
For efficient execution also on computer clusters or in a cloud compute environment where local disks are not available or are relatively small, the algorithm is completely integral-direct utilizing an effective DF-based and batched approach for the evaluation of t 1 -transformed four-center integrals. 28,76 At the same time, in preparation for moderate per-node memory cases only the absolutely necessary four-dimensional tensors are kept during the CCSD iteration (doubles CC amplitudes and residuals). Low per-node memory footprint is also ensured by a shared-memory intra-node (OpenMP) parallelization strategy and symmetrypacked array formats. Consequently, disk input/output (I/O) and network use are completely avoided within a CCSD iteration and the evaluation of the (T) correction. In order to minimize the operation count the highest possible permutational symmetry is exploited throughout and, an additional layer of inter-node (MPI) parallelization is implemented on top of OpenMP. We have found that, for a significant portion of the target use cases, when the virtual/occupied (n v /n o ) ratio is in the range of 5-10, the usual assumption that the O(n 4 v n 2 o /4)-scaling particle-particle ladder (PPL) term dominates the cost of CCSD does not hold. In such cases the remaining O(4n 3 v n 3 o )-scaling terms also take comparable time, for which a relatively limited attention have been devoted in the literature. Such cases occur with relatively small, e.g., double-ζ quality basis sets, small molecules with only a few hundred MOs, or when the virtual basis is successfully compressed via NO approximations, such as in our LNO-CC scheme. 21,57 Thus, we also developed hand-optimized, memory-efficient, in-core, and well-parallelized algorithms to all terms appearing in the t 1 -transformed CCSD equations. 28,76 Our recent OpenMP parallel (T) algorithm 56 was also updated to match the minimum memory demand of the new CCSD code, while making it also MPI/OpenMP parallel and free from disk I/O and network use.
Some of our algorithmic choices are inspired by the t 1 -transformed DF-CCSD implementation of DePrince and Sherrill 28 with notable improvements regarding the MPI parallelism, optimized non-PPL terms in CCSD, more than three times smaller minimum memory requirement, and the rigorous avoidance of disk I/O during a CCSD iteration step. The most recent CCSD(T) programs of the MPQC 19 and FHI-aims 20 packages are also similar to ours in some aspects, like the DF-based (semi-)integral-direct execution and high parallel efficiency. The main deviation in our algorithm design is that here the full permutational symmetry is retained for all terms of the CCSD equations, while refs 19 and 20 employ (partially) redundant solutions for some of the contractions and data structures appearing, for instance, in the PPL term. Since the packing/unpacking operations required for the efficient matrix-matrix multiplication-based formulation of the most demanding contractions do not scale well with the number of cores, the solutions of Peng et al. 19 and Shen et al. 20 are expected to perform better with thousands of cores. Since massive parallelism is already provided by the completely independent evaluation of numerous moderate-sized CCSD(T) problems in the second and third target applications, and we were able to perform largescale CCSD(T) calculations in the 1000-1500 orbital region with a few hundred cores, the presented algorithm matches our goals well. It is also worth noting that while, the scheme of Scheffler and coworkers 20 and ours employ a hybrid MPI/OpenMP strategy and compute the (T) correction via the "ijkabc" approach, 77,78 Valeev and coworkers 19 implemented a purely MPI-based framework with extensions to graphics processing units and utilization of many integrated cores, and evaluate an "abcijk " type (T) expression. 78,79 Additionally, refs 19 and 20 rely on a distributed-memory model suitable for massive-parallelism, whereas our low-memory, batched, and fully integral-direct algorithms provide more bandwidth-economic replicated memory solutions for systems of up to about 2000 orbitals.
After providing the theoretical background in Section 2 and the detailed introduction of the novel algorithms in Section 3, extensive benchmarks are presented for both intra-node 6 and inter-node parallel scaling in Section 5. The measurements representing typical use cases, e.g., appearing within an LNO-CCSD(T) calculation, demonstrate excellent strong scaling comparable to that of state-of-the-art implementations, 19,20,28,80 while close-to-ideal absolute efficiency is also displayed in terms of peak performance utilization. Section 6 presents illustrative applications at the applicability limit of current CCSD(T) codes for three reactions taken from recent mechanistic studies. [81][82][83] The resulting 13 correlation energies and 12 reaction energies and barrier heights are added to our recent 26-item correlation energies of medium-sized systems (CEMS26) compilation 21 and will be employed for the accuracy assessment of CCSD(T) approximations.

CCSD energy and amplitude equations
Assuming a closed-shell reference determinant consisting of spatial orbitals, the CCSD energy can be written as where indices i, j, . . . (a, b, . . . ) denote occupied (virtual) orbitals, t a i and t ab ij represent the singles and doubles amplitudes, respectively, f pq are elements of the Fock matrix with general orbital indices p, q, . . . , and with (pq|rs) as a two-electron repulsion integral in the Mulliken convention.
The equations determining the excitation amplitudes are derived by projecting the Schrödinger equation on the space of excited determinants, e.g., Here, T 1 and T 2 are the cluster operators corresponding to single and double excitations, Φ 0 and Φ ab ij denote the Hartree-Fock (HF) and a doubly excited determinant, and H is the Hamiltonian operator of the system.
The four-center ERIs, collected in matrix K with elements K pq,rs = (pq|rs), will be evaluated relying on the DF approximation [84][85][86] as where I collects the I pq,Q = (pq|Q) three-center two-electron integrals with Q indexing the functions of the auxiliary basis. The two-center two-electron integral matrix of the auxiliary basis functions, with elements V P,Q = (P |Q), is factorized using Cholesky-decomposition as V = LL T , where L is a lower triangular matrix. After the decomposition the inverse of V can be recast as V −1 = (L −1 ) T L −1 , and the inverse of L can be contracted with I, forming J as The CCSD amplitude equations can greatly be simplified by absorbing the effect of the single excitations into the Hamiltonian 87 via defininĝ The above t 1 -transformed Hamiltonian is expressed by the elements of the corresponding transformed Fockian (f pq ) and the elements of the transformed three-center ERIs (Ĵ Q pq ) for which eqs 29-31 of the Appendix collect the explicit expressions.
Substituting the t 1 -transformed Hamiltonian of eq 6 into the CC equations, the singles 8 (R a i ) and doubles (R ab ij ) residuals can be written as 28,87 9 where the following shorthand notations are also exploited:

The perturbative (T) correlation energy
The closed shell (T) energy expression in the canonical orbital basis can be written as 8,79 where D abc and Here, the permutation operator, P abc ijk is introduced as Let us note that eqs 21 and 22 do not depend on the transformedf andĴ since the evaluation of the (T) correction cannot be accelerated by the t 1 -transformation.
The sixfold permutational symmetry of the intermediate quantities can be utilized to evaluate the perturbative triples energy expression. When working in the canonical orbital basis of a closed-shell system, two suitable energy expressions can be derived 77,78 for this end.
Here, we limit our discussion to the case when the outermost loops run over the restricted occupied indices. For the explicit energy formula corresponding to this so-called "ijkabc" algorithm, we refer to eq 5 of ref 56.

Algorithm
This section presents algorithmic developments and parallelization efforts devoted to the CCSD iteration and then to the perturbative triples correction. According to our target application types the following priorities are set for the new DF-CCSD(T) code: minimal memory footprint, optimal operation count in the sixth-and seventh-power-scaling terms, negligible hard disk and network use, and good parallel scaling.

CCSD algorithm: data structures and integral assembly
Dealing with limited per node memory and data traffic bandwidths is one of the cornerstones of current algorithm design. Hence we prioritize the minimization of data storage and hard disk/network use to increase CPU efficiency. For that purpose our aim is to minimize the storage requirement with minor sacrifices regarding the operation count so that all necessary quantities will be available in local memory or can be effectively recomputed when needed for as large orbital spaces as possible.
Since the best way to effectively factorize or compress the doubles amplitudes and residuals on a production level is still under active development, 32 It is worth noting that the above memory reduction techniques do not increase the arithmetic complexity of the algorithm, that is, the O(N 6 )-scaling is retained for CCSD.
Compared to that the O(N 4 ) and O(N 5 ) complexity of unpacking the doubles amplitudes and the integral assembly is affordable. It is still worth decreasing the number of integral assemblies within an iteration as much as possible. As the t 1 -transformation incorporates the effect of the singles amplitudes into the Hamiltonian, all terms containing the singles amplitudes in the conventional CCSD equations are merged into terms reminiscent of the ones that appear in the coupled-cluster doubles (CCD) equations. 28 This simplification makes the algorithm more amenable to hand-optimization and parallelization. One apparent benefit is that one of the sixth-power-scaling terms can be more efficiently refactorized leading to a small prefactor decrease in the overall CCSD algorithm. 28 Additionally, a formulation can be chosen that does not require the assembly of the three-external four-center integrals at all, and the all virtual four-center integrals are only needed once per iteration. 28 In turn,

Algorithms for the CCSD residual equations
After finding a satisfactory solution for the representation of the integrals and wavefunction parameters, we turn our attention to the optimization of wall times within the set constraints. This is demonstrated by the detailed analysis of the individual terms.
First, we consider the supposedly most expensive O(N 6 )-scaling term, the A ab ij contribution of eq 12, also known as the particle-particle ladder or PPL term. The total operation count of a naive implementation of this term is 2 n 4 v n 2 o , where we consider both the addition and multiplication operations in order to have accurate floating point operation (FLOP) counts for performance analysis. It is well known that this operation count can be reduced by symmetrizing the corresponding amplitude and ERI tensors, 88,89 which is, however, still challenging to implement with massive parallel scaling. 10,11,19,20 Presently we take advantage of the symmetrization as fourfold and twofold improvement can be achieved for the contraction and the four-external assembly steps, respectively. Following the DF-based formulation of ref 28 the A ab ij term is evaluated as where superscripts (+) and (−) denote symmetric and antisymmetric combinations, respectively, and δ ab stands for the Kronecker delta symbol. Compared to that of ref 28 our algorithm for this term, presented in Algorithm 1, includes several improvements: it is more memory economic, I/O-free, and massively parallel via an OpenMP/MPI strategy, while retaining the use of effective matrix-matrix multiplications. The evaluation of the PPL term Algorithm 1 Evaluation of the A ab ij (or PPL) term. 1: for n block = 1, (n v (n v + 1)/2) /n B // MPI and OpenMP parallel 2: for ab = (n block − 1)n B + 1, n block n B 3:

4:
(±) I cd ab = I ab cd ± I ab A ab ij = 1/2 (±) I cd,ab † (±) T cd,ij 7: for a ≤ b: R abij ← (±) A ab ij 8: for a > b: R abij ← ± (±) A ba ij 9: end for Due to the increased importance of these terms for our target applications and the fact that these terms are mostly omitted in the algorithm optimization efforts in the literature, we decided to devote more attention to these contributions. It turned out that there are parts where a clearly optimal choice did not present itself because, for instance, the lowmemory symmetry-packed route deviates from the one that is best for parallelization. For this reason, we found it interesting to present our thoughts on these terms in more detail than usual since other preferences might lead to alternative algorithmic choices, and this could contribute to a useful discussion in the literature of CCSD.
The evaluations of terms depending on the same integrals and intermediates are merged in order to reduce the number of integral assembly steps via exploiting reusable intermediates.
Namely C ab ij is calculated together with B ab ij , and it is also beneficial to group E ab ij , G ab ij with the terms contributing to the singles amplitudes equations (A a i , B a i , and C a i ). The remaining terms, scaling as O(N 5 ), do not require such exhaustive optimization.
It is, however, ensured that none of those terms lead to bottlenecks related to inefficient, bandwidth-bound operations when many CPU cores are used, and that the minimal memory requirement is not increased. For this reason, it is worth decreasing the number of these O(N 5 )-scaling steps. The algorithm for these terms is shown in Algorithm 4. First, we should recognize that almost all of these terms depend on the same u ·Ĵ product. The size of this intermediate is cubic-scaling, so the entire uJ array can be stored in memory (see line 7 of Algorithm 4). The advantage of calculating uJ first is that it reduces the Algorithm 2 Evaluation of the C ab ij and the B ab ij terms. 1: for n block = 1, n o /n B // MPI and OpenMP parallel 2: for k = (n block − 1)n B + 1, n block n B C ab ij 3: I dlck =Ĵ kd,P Ĵ lc,P †

Parallelization of the residual equations
The intra-node parallelization utilizing multi-core CPUs is performed using the OpenMP application programming interface, which facilitates multi-threaded execution with a many-Algorithm 3 Evaluation of the D ab ij term. 1: for n block = 1, n o /n B // MPI and OpenMP parallel 2: for k = (n block − 1)n B + 1, n block n B

4:
L dl,ck = 2 L dl,ck − L cl,dk 5: end for 16: end for core processor. On top of that, the MPI protocol is employed to distribute the workload among multiple nodes of a computer cluster. It is beneficial to combine the two parallelization strategies because MPI tasks cannot access the memory space of the other processes without introducing additional communication operations, while OpenMP threads running on the same node can directly access data in a shared memory environment. Thus, in the implemented hybrid MPI/OpenMP approach the memory of a non-uniform memory access (NUMA) node, or optionally the entire memory of a multi-processor node is shared among the threads of the OpenMP layer, and the MPI layer distributes the workload between the NUMA or complete compute nodes.
Since the individual terms in the CCSD residual equations are already partitioned into blocks, it is intuitive to spread the batches among nodes. Accordingly, the MPI parallelization is performed by distributing the blocks of ab indices for the PPL term and the blocks of the occupied k indices for the other terms across the MPI processes. This strategy is benefi- uJ bkP = u bk,ldĴld,P 8: for k ≥ i: R adik ← −G i adk 22: end for cial because, as long as the key quantities, e.g., the three-center two-electron MO integrals, the amplitudes, and residuals, fit into the memory of a single node, their inter-node communication is not needed. It is only necessary to gather the residual contribution of each process at the end of an iteration. Alternatively, the four-index arrays and the three-center integrals could be distributed among the nodes. 19,20 This strategy decreases the memory demand of the MPI tasks running on a single node but introduces a potentially limiting communication cost. The applications of the present report were performed with the replicated memory model since this choice was allowed by our highly memory-efficient implementation. The workload is distributed statically for each term except for the PPL. The distribution of 20 blocks in the PPL term occurs dynamically, and its evaluation takes place at the end of the iteration for load balancing. This dynamic load distribution strategy also facilitates the use of a heterogeneous hardware environment, e.g., nodes with different processor types or memory amounts can also be utilized effectively up to a reasonable degree of heterogeneity.
The poor performance and parallel scaling of the bandwidth-bound operations, like the speedup for a (H 2 O) 10 cluster with the cc-pVDZ basis set on a 16-core node compared to our previous CCSD program. 53 Most of this gain can, however, be attributed to the fact that the previous implementation was not optimized to run efficiently on a large number of threads.

Integral-direct and parallel (T) algorithm
The so-called "ijkabc" type implementations of the (T) correction usually compute the W and V intermediates of eqs 21 and 22 for all virtual orbital index triples in nested loops over fixed "ijk " triplets. In order to cumulate all contributions of W into the same three-dimensional array, the indices of the intermediates resulting from the contractions of eq 21 are usually permuted to a matching order, say "abc". We have shown previously that this permutation, despite its lower, sixth-power-scaling operation count, has a highly bandwidth-bound nature as opposed to the effective, compute-bound contractions. 56 For this reason we proposed an algorithm for multi-threaded use which exploits the permutational symmetry of the Coulomb integrals and the doubles amplitudes in order to eliminate all but one of the permutations needed. 56 Here, we report a further improved "ijkabc" (T) algorithm by eliminating the one remaining permutation, as well as any remaining disk usage. Additionally, the minimal memory requirement is compressed to be comparable to that of the new CCSD algorithm, and the code is also parallelized via a similar MPI/OpenMP route.
The updated ijkabc (T) algorithm is presented in Algorithm 5. To avoid the redundant unpacking of the doubles amplitudes inside the three loops over the i ≥ j ≥ k indices, it is beneficial to perform this operation outside all the occupied loops. Since the residual array from the CCSD calculation is no longer needed, the unpacked doubles amplitude tensor can be stored in a space comparable to that allocated for the CCSD calculation. With a sufficiently low memory requirement, the next task is the wall time optimization of the (T) correction because of its even steeper scaling than that of the CCSD method. The for c = 1, n v // OpenMP parallel 15:

Computational details
The above CCSD(T) approach has been implemented in the Mrcc suite of quantum chemical programs, and it will be made available in a forthcoming release of the package. 90 The benchmark tests were carried out with compute nodes containing two Intel Xeon For the benchmark calculations correlation consistent basis sets, cc-pVXZ, 91 were utilized with the corresponding DF auxiliary bases, cc-pVXZ-RI. 92 The calculations in Section 6 utilize triple-and quadruple-ζ valence basis sets including polarization 93 and diffuse functions 94 (def2-TZVPPD and def2-QZVPPD) with the corresponding auxiliary basis sets. 95 The core electrons were not correlated in any of the presented cases.

Performance analysis
In this section we analyze the multi-threaded and multi-node parallel performance of the in-  Table 1.

Multi-threaded performance of CCSD(T)
The multi-threaded scaling of the CCSD algorithm is depicted in Figure 1. Regarding the results for CCSD (Figure 1), all five investigated implementations show an excellent speedup, mostly between 8 and 11 with 16 threads. Our implementation also demonstrates an efficient CPU utilization with about 65% of the theoretical peak performance with 16 threads.   First, all threads and data allocations were assigned to the same NUMA node, and then the threads and the data were fixed on different NUMA nodes. The wall time measured with the data in the non-local memory was found about 13% longer compared to the one with only local memory access. However, this is clearly the limiting case and, in realistic applications, when only one MPI task is running on a node, the data is distributed between the local and non-local memory. When there is enough memory, it is advisable to run at least as many MPI processes per node as the number of NUMA nodes to avoid this slower memory access.
The CCSD calculation benefits more from the higher number of threads on the first

Multi-node performance of CCSD(T)
The multi-node performance was determined on the example of the uracil trimer for CCSD and for the (H 2 O) 14 cluster for (T), both with the cc-pVDZ basis set. The multi-node scaling of the CCSD and (T) parts are shown in Figures 8 and 9, respectively.    That is, however, satisfactory from the perspective of the planned applications since a few hundred compute cores can still be efficiently exploited by using up to a few tens of nodes equipped with many-core CPUs.
Further performance benchmarks are given in Section 6.2 where various MPI/OpenMP strategies are compared for large-scale examples.

Applications
In this section, we illustrate the capabilities of the presented CCSD(T) code on chemical problems, which would be out of the scope of conventional implementations. The performance and efficiency of the program is also analyzed for these large-scale calculations executed with a few hundred cores.

Expansion of the CEMS26 test set of CCSD(T) references
According to the first goal of providing valuable reference data for the assessment of reducedscaling methods, we expand our previous CCSD(T) reference compilation 21 with correlation energies, reaction energies, and barrier heights obtained for real-life catalytic reactions. [81][82][83] The first version of the correlation energies of medium-sized systems list contains 26 entries, hence the abbreviation of CEMS26. 21 First, in order to consider realistic test systems which are also capable to assess the accuracy of various local approximations, each molecule of CEMS26 contains at least 30 atoms. Second, all correlation energies are obtained with at least triple-ζ quality basis sets to provide some flexibility in the one-particle basis set.
Third, high reproducibility is required for these benchmarks, thus, for example, reduced-cost NO-based calculations were excluded. Together these three criteria represent a very strong set of limitations. After an extensive literature search we were only able to find a handful of suitable calculations to add to the test set. Aiming at a more representative data set size we also performed a number of extensive CCSD(T) calculations, but a large portion of those had to exploit spatial symmetry to have an affordable computation cost. The higher than average portion of spatially symmetric molecules is thus not representative, which is corrected here by adding 12 asymmetrical molecules of 31 − 43 atoms to the list. Another unfavorable feature of the test set is the relatively low number of results calculated with basis sets other than Dunning's correlation-consistent basis sets 91 or with ones including diffuse functions. To improve upon this aspect at least triple-ζ, and for one entry a quadruple-ζ, basis set were employed including both polarization and diffuse functions (def2-TZVPPD and def2-QZVPPD). 93,94 The previous entries were added using their ground state global minimum structures. The current selections thus include also local minima and transition state structures along the previously explored reaction paths. [81][82][83] Three reactions are considered: an organocatalytic Michael-addition reaction, 81 the hydrogen activation by a frustrated Lewis pair (FLP), 82 Table 2. These correlation energies are also useful to expand the CEMS26 set with 8 new reaction energies and 4 barrier heights, which are collected in Table 3.  paths are found in a fairly narrow energy range. Thus, highly accurate calculations are required for the reliable characterization of the reaction mechanism. 81 The species added to the CEMS26 test set are the OO and the CB intermediates as well as the TS1 and TS2 transition states ( Table 2)    Finally, we consider a palladium catalyzed C-H bond activation reaction (see Figure 12). 83 In the reaction, palladium catalyzes the cross-dehydrogenative coupling between anilides, like The energies calculated for ABP with the def2-TZVPPD and the def2-QZVPPD bases were also extrapolated to the basis set limit. The extrapolation was carried out utilizing the twopoint expression of Karton and Martin 96 for the HF energies using the parameters suggested by Neese and Valeev. 97 Correlation energies was extrapolated using the formula introduced by Helgaker et al. 98 As shown in Table 3, at least for this particular example, the reaction energy converges relatively rapidly with respect to the basis set size. Namely, -74.4, -73.8, and -73.5 kcal/mol were obtained, respectively, with the def2-TZVPPD and the def2-QZVPPD basis sets, and as the result of the extrapolation. currently the most realistic test set aimed at the representative assessment of local correlation methods. We will employ CEMS39 for that purpose in a forthcoming publication in the context of our LNO-CCSD(T) method. 21,57

Performance for large-scale examples
To characterize the efficiency of the program also for large examples, we employed various settings for the number of MPI processes and OpenMP threads as well as for the parallelism of the inner parallel region (threaded or sequential BLAS library). We measured the efficiency for these calculations relative to the theoretical peak performance of the 4 Intel Xeon Platinum 8180M processors utilized in these numerical experiments. For this particular CPU the theoretical peak performance can be calculated with 1.7 GHz base frequency, which is the limit when the AVX-512 instruction set and all cores are employed. The measurements 39 are summarized in Table 4.  We also improved upon a previous t 1 -transformed CCSD algorithm, 28 for instance, by optimizing and parallelizing all contractions besides the usually emphasized particle-particle ladder term. As the highest possible permutational symmetry and lowest operation count are ensured for the PPL term, some of which is sacrificed in alternative parallel CCSD implementations, 10,11,19,20 the remaining four O(N 6 )-scaling terms are found to be comparably time-consuming in some of our target applications. Thus, handoptimized and well-scaling algorithms are presented also for those terms appearing in the Due to the balanced performance obtained also for relatively small systems appearing 42 also in popular data sets used for parametrizing DFT functionals 59,60 or machine-learning models [61][62][63][64][65][66][67][68][69] and to the minimal memory and disk footprint, we believe that the present code is well suited to produce such large-scale benchmark CCSD(T) data. These properties are especially useful if only a network file system and limited per-core memory is available, as in many current computer clusters, allowing for the almost independent evaluation of hundreds by the BME-Biotechnology FIKP grant of EMMI (BME FIKP-BIO). The work of PRN is supported by the ÚNKP-18-4-BME-257 and ÚNKP-19-4-BME-418 New National Excel-

Supporting Information Available
See supporting information for the complete list of Cartesian coordinates employed for the CEMS39 set; and for the computed HF, MP2, CCSD, and CCSD(T) energies.
This material is available free of charge via the Internet at http://pubs.acs.org/.
f qp . Furthermore, the occupied-virtual block of J is invariant to the t 1 -transformation, 76 i.e., J P ia = J P ia . The transformation of the individual blocks of J and f can be carried out according to the following equations: In contrast to the approach of ref 28, here the t 1 -transformation is performed in the MO basis because in this case the three-center AO integrals do not have to be stored during the CCSD iteration. Let us also note that the auxiliary basis required for the correlated calculation is employed for the three-center integrals, thus our t 1 -transformed expressions yield exactly the same numerical results as a DF-CC implementation without t 1 -transformation.
In that respect we also deviate from the algorithm of ref 28, where, to our understanding, the auxiliary basis of the SCF calculation is employed to form the t 1 -transformed Fock-matrix.
In order to save memory space during the CCSD iterations, we do not store the original MO integrals because they can be recovered from the t 1 -transformed integrals via the inverse transformation. This can be achieved by inverting the transformation defined by eqs 32-35.
For example, theĴ P ij integrals can be calculated aŝ