Algorithms for Tensor Network Contraction Ordering

Contracting tensor networks is often computationally demanding. Well-designed contraction sequences can dramatically reduce the contraction cost. We explore the performance of simulated annealing and genetic algorithms, two common discrete optimization techniques, to this ordering problem. We benchmark their performance as well as that of the commonly-used greedy search on physically relevant tensor networks. Where computationally feasible, we also compare them with the optimal contraction sequence obtained by an exhaustive search. We find that the algorithms we consider consistently outperform a greedy search given equal computational resources, with an advantage that scales with tensor network size. We compare the obtained contraction sequences and identify signs of highly non-local optimization, with the more sophisticated algorithms sacrificing run-time early in the contraction for better overall performance.


I. INTRODUCTION
Tensor networks are a convenient language for studying the statistics of discrete systems with local interactions. The partition function and correlation functions of many lattice models may be written as tensor networks. Similarly, typical states of quantum systems often admit an efficient representation as a tensor network, either in the form of matrix product states (MPS) 1,2 or more general states such as tree tensor networks 3,4 and projected entangled pair states (PEPS) 5 . Tensor networks have also been used as machine learning classifiers 6,7 .
At the core of these applications is the problem of tensor network contraction, in which all intermediate bonds in a tensor network are summed to evaluate the network. Because these sums are performed simultaneously, a naive tensor network contraction uses computational resources which are exponential in system size, and so better approaches are needed.
Unfortunately, the problem of contracting tensor networks lies in the computational complexity class NP 8 and so it is likely that in the worst case exponential resources will always be needed. Nonetheless, it is often possible to approximate tensor network contraction, resulting in efficiently-computable answers with controllable errors 5,[9][10][11][12][13][14] . Moreover, many useful tensor networks, including MPS networks 1 , can be contracted exactly in polynomial time by taking advantage of the property that the computational cost of contracting a tensor network depends strongly on the order of summation while the result does not. Hence while the worst cases may be intractable, there is still room to improve in typical or special cases.
To that end, we examine two algorithms which are widely used in discrete optimization.
Our aim is to see if these algorithms provide any improvement over standard methods and hand-crafted contraction sequences. These are Genetic Algorithms 15,16 and Simulated Annealing 17,18 . We begin in Section II by reviewing the structure of tensor networks, Penrose no-tation, and the computational cost of contracting sequences. In Section III we then describe the algorithms in more detail, along with the commonly-used Greedy Search algorithm 19 and the reference Exhaustive Search method [20][21][22][23] . We then perform numerical experiments in Section IV, testing these methods on both two-dimensional square tensor networks and Erdős-Rényi random networks, and find that both algorithms outperform the Greedy Search in most of our experiments, often by many orders of magnitude. We examine specific contraction sequences in Section V to understand how these algorithms craft such efficient sequences, and conclude with a discussion of our results in Section VI.

II. TENSOR NETWORK CONTRACTION
A tensor network is a list of tensors along with a specification of which pairs of their indices are meant to be contracted. So for instance, specifies a network N formed of three tensors T , X and Y , with two contracted pairs of indices, namely i and j. The network itself is tensor-valued, with the four indices k, l, m and n, which correspond to the indices of the constituent tensors which were not contracted.
A key feature of tensor network contraction is that individual summations commute. That is, the sums in equation (1) may be done simultaneously, but we could also perform first the sum over i, producing the intermediate tensor and only then perform the sum over j to evaluate Likewise, we could first sum over j, producing and then sum over i to obtain Both pathways arrive at the same answer, but they may have very different computational costs. For instance, the intermediate Q has a higher rank (number of indices) than the intermediate Q, and so if all bonds have the same dimension, Q requires more memory to store and more computation time to evaluate. It is often convenient to write small tensor networks explicitly, as in equation (1), but for large ones this quickly becomes cumbersome. Instead we depict larger networks graphically using Penrose notation, with squares representing tensors and lines representing indices 24 . So, for instance, the network specified by the right-hand side of equation (1) is shown graphically in Figure 1.
In this notation, performing a single sum amounts to combining two nodes in the graph into one. Hence, summing over j in equation (1) produces the network shown in Figure 2. Then, summing over i finally yields the evaluated network shown in Figure 3.
We call the order in which we contract pairs of indices a contraction sequence. To calculate the computational cost of a given contraction sequence, we count the number of floating-point multiplications that have to be performed 20 . This is equal to the number of floating-point additions, and so counts the number of operations required and the run-time up to a constant factor. Our where {E} denotes the ordered set of edges to be contracted, {v e } denotes the set of edges adjoining the two vertices connected by the edge e (including e itself) at a given contraction step, and χ(m) is the bond dimension of edge m, i.e., the number of different values that the index associated with m can assume. Holding coordination number fixed, the computational complexity of evaluating this cost function is O(E) where E is the number of edges in the network.
As an example, consider the contraction in Eq. (1). Contracting the index i first amounts to χ(i)χ(j)χ(k)χ(l) elementary operations. Following up with the sum over the index j then adds another χ(j)χ(k)χ(l)χ(m)χ(n) operations. The total cost of this contraction ordering is therefore cost ij = χ(i)χ(j)χ(k)χ(l) + χ(j)χ(k)χ(l)χ(m)χ(n). (7) By contrast, the cost of contracting first j then i is These are clearly different, and so there is a room to optimize by picking the lower-cost option.

III. ALGORITHMS
We have tested four algorithms. Two of these, namely Exhaustive Search and Greedy Search, are in common use for obtaining tensor network contraction sequences. So far as we are aware, the other two have not previously been used to this end.
The other algorithms, namely Simulated Annealing and the Genetic Algorithm, share a common structure which will aid in their comparison. Each consists of a procedure for generating batches of contraction sequences. Only one batch is considered at a time. The cost of contracting the given tensor network with each sequence in the batch is then computed. If any sequence in the proposal has a lower cost than the previous lowest-cost sequence it is stored in place of the previous best sequence. The algorithm then proceeds to propose a new batch, possibly using the results of the previous batches. This process iterates until either all possible orderings have been considered or a time limit is reached. Where these methods differ is in the rule for producing new batches. Because of this structure both methods may be run for as long as desired and can at any point return the best ordering found so far. We will take advantage of this to restrict each method to a fixed number of evaluations of the cost function. This limitation is a proxy for a runtime limit that is insensitive to details of the implementation of the algorithm or the underlying hardware, which makes it a useful means of comparison.
We now detail the four algorithms. Implementations for the Greedy Search, the Genetic Algorithm, and Simulated Annealing can be found at github.com/frankschindler/OptimizedTensorContraction.

A. Exhaustive Search
Exhaustive Search comes in several varieties. In its most basic version every possible contraction ordering is considered exactly once. The cost of each is evaluated and the ordering with the lowest cost is returned.
This algorithm is deterministic and always returns the optimal contraction sequence. Because the number of contraction sequences to consider scales like O(e E ), where again E is the number of edges in the network, the run-time of this algorithm is exponential. More advanced variants of this algorithm incorporate tree pruning 20 to avoid considering sequences which can be proven to have higher cost than others, but in the worst case the cost is still exponential.
For the numerical results we present for the Exhaustive Search, we adapted the MATLAB version of the Netcon algorithm from Ref. 20 to also output the accumulated number of cost function evaluations. We then ran it with the parameter choice costType = 1, muCap = 1, allowOPs = false. We used the MATLAB version R2019a and Netcon version 2.01.

B. Greedy Search
Greedy Search begins by considering the cost of performing just one step in the contraction. Evaluating this incremental cost takes time which is O(1). Each possible first step is considered, and the one with the lowest cost is taken. The method proceeds recursively, considering next all possible second steps.
Alternate variants of Greedy Search have been used which consider multiple steps simultaneously 19,20 . For instance one could consider all possibilities for the next two steps, or more generally for the next k steps. The cost of this algorithm considering k steps simultaneously is O(E k ) incremental cost function evaluations, which is equivalent to O(E k−1 ) evaluations of the full cost function. Because the cost grows rapidly with k we only consider the commonly-used 19 case of k = 2 in the following.
For the numerical results we present for the Greedy Search and the remaining algorithms, we made use of Python 3.7.4, with the libraries Numpy 1.17.2, Scipy 1.3.1, itertools, and copy. We implemented the k-step Greedy Search as a standalone Python function built on these tools.

C. Genetic Algorithm
The Genetic Algorithm begins by evaluating the fitness (the negative cost) of each contraction in a starting population of randomly generated sequences 15,16 . It then samples a new population, drawing from the starting population with replacement and with probabilities that are proportional to the fitness of the individual contraction sequences. This models the extinction of unfit specimen. Furthermore, the contractions in the new population are subject to mutation, in that there is a finite chance that the ordering of two randomly selected edges is exchanged in the respective sequence (the fittest individual of the population is always kept unchanged). This process is then iterated.
We implemented the Genetic Algorithm in Python. We chose a population size of 20 and a mutation rate of 60%. For the fitness function, we used where {E min } and {E max } are the contraction sequences with the lowest and highest cost in the population, respectively. This fitness function was chosen heuristically to return natural values in the range (0, e − 1) while retaining the hierarchy of scales resolved by the original cost function (up to a power). Note that the subtraction of 0.99 matters because we generate probabilities from a population's fitness distribution after normalization. We checked that the performance of the Genetic Algorithm is not sensitive to the precise choice of fitness function.

D. Simulated Annealing
Simulated Annealing works with an alternative representation of contraction sequences, where we encode permutations of edge labels by arrays of real numbers taken from the interval [0, 1]. A contraction can then be obtained from the permutation that orders the numbers in the respective array by magnitude. This representation has the advantage that it allows for a continuous deformation of the arrays while the constraint that each element represent a valid permutation is implicitly taken into account. This allows us to use the dual annealing 25 variant, which combines the standard classical annealing algorithm with a local optimizing search.
For the numerical results we present for Simulated Annealing, we employed the implementation of the dual annealing algorithm that is available from the optimize package of the Scipy library.

IV. NUMERICAL EXPERIMENTS
We perform our numerical experiments on two classes of tensor networks.
The first are two-dimensional square lattices, shown in Figure 4 (a). We choose twodimensional networks because in one-dimension the optimal contraction sequence is already known, so this provides one of the simplest non-trivial test cases.
The second class of network we consider is that of Erdős-Rényi random graphs. These consist of a collection of nodes with edges distributed amongst them at random, such that all pairs of nodes have the same probability of having an edge, and such that edges are placed independently of one another. In our tests we let this probability be 80%. An example of a tensor network generated in this way is shown in Figure 4 (b). Such networks are analogous to spin glasses, and represent some of the most difficult tensor networks to contract due to their large connectivity and high variance in tensor rank.

A. Variable Run-Time
In our first experiment we consider the square network shown in Figure 4  in Figure 5 (a). For Simulated Annealing and the Genetic Algorithm we show the contraction cost given by equation (6) of the best contraction sequence found as a function of the number of cost function evaluations used. The Greedy Search requires a fixed number of evaluations, and so we just show its output with that number of evaluations.
From this experiment we see that Simulated Annealing significantly outperforms the Genetic Algorithm when given the same number of function evaluations. This is not universally true, but we see the same in almost every case. We also see that the Greedy Search performs better than the Genetic Algorithm, but worse than Simulated Annealing at the same number of function evaluations. Figure 4 (b) shows the same experiment but with an increased bond dimension of χ = 10. Increasing the bond dimension dramatically raises the contraction cost for each algorithm. The change in cost is of order (χ new /χ old ) 2L , where L is the linear size of the network. This may be seen by noting that each tensor at an intermediate stage represents a contiguous subset of the original network. Eventually those subsets come to be extensive in size and so come to have perimeter length of order L. Hence at some point each algorithm must contract two tensors with of order L bonds, with cost of order χ 2L .
Interestingly, with larger bond dimension Greedy Search performs worse relative to the other algorithms. We understand this as follows: for small bond dimensions it is possible for many contraction steps to matter in the total cost, because the difference between contractions of different ranks is small. As the bond dimension increases the cost of a contraction sequence comes to be dominated by the cost of the few most expensive contraction step(s). Optimizing a contraction sequence then becomes mostly a matter of avoiding the worst cases. Because the appearance of very expensive contraction steps is a function of the entire contraction sequence up to that point, this is a non-local optimization problem that Simulated Annealing and the Genetic Algorithm are better suited to.
We next repeat these experiments for the Erdős-Rényi random graph shown in Figure 4  Common to all of the panels of Figure 5, we see that Simulated Annealing and the Genetic Algorithm improve in bursts, separated by long plateaus. This makes it difficult to arrive at strong statements about the correct number of function evaluations to use with these algorithms, as there is no clear indication of whether the search will continue improving or not.
The large, order-of-magnitude nature of the bursts, however, suggests a heuristic to use in practice, which is that the search for better contraction sequences should be conducted for a time comparable to the run-time of the current best contraction sequence. We call this the "time-remaining" heuristic. In that way the search at most doubles the run-time if it yields no improvement, while still offering a chance of dramatic gains.  We see relatively little variation in performance across these four algorithms, and to the point where we were able to use the Exhaustive Search the algorithms perform close to the global optimum. To the extent that there is a difference, it is for larger systems, for which Simulated Annealing significantly outperforms the other algorithms.
Interestingly, the relative standard deviation in cost is much larger with both Greedy Search and the Genetic Algorithm. We are not sure why the Genetic Algorithm has a high relative standard deviation. The high rela-tive standard deviation of the Greedy Search is understandable, however: many choices are degenerate for the Greedy Search. Often a tensor network has many different contractions with the same immediate cost, and the same is true at higher search depths. Even though the immediate cost is degenerate, the long-term consequences of these choices on the network may be radically different, causing significant variance in total cost. Figure 6 (b) shows the same experiment but with bond dimension χ = 10. We now see a larger spread between the algorithms. Here we have highlighted the so-called "desktop limit" of cost = 10 16 , which provides a rough bound on the cost of contractions that can reasonably be performed on a modern desktop computer limited to a day of runtime. (Note that if the cost is dominated by a single expensive contraction step the practical limit is somewhat lower, as tensors with 10 16 elements are unlikely to fit into memory.) In particular, the gap between Greedy Search and Simulated Annealing is such that the former hits the desktop limit on systems roughly 10% smaller than the latter, suggesting that with the improvements offered by Simulated Annealing it should be possible to contract larger tensor networks than were previously possible.
The larger bond dimension also brings about an increased relative standard deviation in their performance. The increased relative standard deviation comes about because the cost is now more sensitive to the few most expensive contraction steps, and so becomes more sensitive to the (discrete) ranks of tensors as the contraction proceeds.

Greedy
We next repeated these experiments with Erdős-Rényi random graphs of varying size. The results are shown in Figure 7. For small systems the algorithms all find nearly-optimal contraction sequences. As the system size increases above 10 − 11 nodes a large difference emerges which grows until the Greedy Search performs a factor of 10 − 100 worse than Simulated Annealing, which in turn performs a factor of 10 or so worse than optimal. The relative standard deviation in performance across runs is generally larger than in the square network cases, par-ticularly for Simulated Annealing. The overall increase can be understood as being due to increased variance in tensor ranks making the cost more sensitive to the precise contraction sequence. We are not sure why this affects Simulated Annealing more than the other algorithms, though it may indicate that with random graphs the dual annealing implementation is less able to exploit the structure of the network in its local search steps.

V. CONTRACTION SEQUENCES
To understand how Simulated Annealing comes to outperform the Greedy Search it is useful to examine a typical contraction sequence produced by each algorithm. In  Arranging to retain lower-cost options is inherently a global optimization process, because the cost of contracting an edge is strongly dependent on the stage at which it is contracted and on the contraction order leading up to that point. This explains why Simulated Annealing is able to achieve this task while Greedy Search is not: the early costlier contractions that Simulated Annealing per-forms act to reduce the cost of the most expensive steps towards the end.
A hint is provided by the Simulated Annealing sequence between steps 26 and 29, and again between steps 32 and 43. In both cases, one or more bonds emerge which involve expensive contractions. In the first instance Simulated Annealing performs the expensive contraction, which enables lower-cost options afterwards. In the second instance it merges nearby nodes into those adjacent to the expensive bond (e.g. 35 -36). In doing so it produces self-loops on the adjacent nodes which, upon elimination, reduce the cost of the deferred expensive step.
We next turn to the best contraction sequence produced by each algorithm. In Figure 10 (a)-(c) we show the best contraction sequence of each algorithm taken across 40 runs for the same network as in Figure 8. We also show for comparison a typical hand-crafted contraction sequence similar to the corner transfer matrix method commonly used with PEPS. In this sequence rows are contracted together repeatedly until just one remains, at which point that row is contracted down to a point. We see that both Greedy Search and Simulated Annealing significantly outperform the hand-crafted se-  quence. Both algorithms manage this by producing fewer high-rank tensors, which is enabled by "contracting inwards" from the perimeter rather than working with a single edge the whole time. Doing so reduces the number of extra bonds accumulated by each tensor on the edge, which holds the rank down.
In Figure 10 (d) we see that, like in the median case, there are just a few steps which together dominate the cost of the best contraction sequence for each algorithm. However, unlike in the median case, in the best case the costs of the different contraction sequences are of the same order of magnitude, which is a factor of 10 or so better than the median case for the Simulated Annealing algorithm. This suggests that the best case for the Greedy Search is a bigger improvement over the median case than the best case for the Simulated Annealing algorithm is over its median case.
We can understand this improvement by noting that in the early stages of the Greedy contraction sequence there is significant degeneracy between the various leastexpensive contractions. This results in a wide variety of different possible states following the first 10 or so contraction steps. The typical such state is evidently much harder to continue contracting than the best such state. This conclusion highlights the importance of optimizing globally over the whole contraction sequence, and not just locally as Greedy Search does.
Finally, in Figure 10 we show the best contraction sequence of Greedy Search and Simulated Annealing across 40 runs for an Erdős-Rényi random graph. We see again that Simulated Annealing does a much better job of preserving comparatively good options throughout the contraction, while Greedy Search exhausts its cheap contraction options early on and is forced into a run of very expensive contraction steps. Unfortunately the random structure of this graph exacerbates our earlier challenge interpreting the Simulated Annealing contraction sequence, and it is not clear exactly what choices it is making that enable such good long-run performance. Nevertheless, the performance is remarkable: Simulated Annealing finds a contraction sequence that is 100 times faster than that of Greedy Search, and does so with the same number of cost function evaluations.

VI. CONCLUSIONS
We have optimized tensor network contraction sequences using four different algorithms, namely Exhaustive Search, Greedy Search, Simulated Annealing, and a Genetic Algorithm. The first two of these are commonly used in contracting tensor networks, while to our knowledge the latter two have not been used in this domain. We find that Simulated Annealing significantly outperforms Greedy Search, both in the best case and on average. In many cases the cost of the contraction sequence found by Simulated Annealing is orders of magnitude lower than that of Greedy Search with a comparable amount of search time. This advantage grows larger with network size, and is most notable on networks with structure such as the square lattice.
With additional search time Simulated Annealing performs even better, often by a large enough margin to justify the extra time spent optimizing the contraction sequence. This suggests a potential strategy to use in practice, our "time-remaining" heuristic, in which one optimizes the contraction sequence until the time spent optimizing is comparable to the cost of the current best known contraction sequence.
While contracting large tensor networks generally requires approximation, our results suggest that exact contraction may be viable for larger networks than previously thought. This may provide benefits even for approximate contraction methods such as Tensor Network Renormalization, which often rely on repeatedly and exactly contracting moderate-sized networks to produce inputs into the approximation scheme 26 .
Unfortunately we have been unable to extract any further intuition from these contraction sequences. They do not appear to lead to design principles we may use to craft custom sequences for particular classes of networks.
In practice, however, this may not matter: algorithmically-generated contraction sequences may well suffice, particularly if they are more efficient than hand-crafted or heuristically-guided ones.