Inclusion-Exclusion Principle for Open Quantum Systems with Bosonic Bath

We present two fast algorithms which apply inclusion-exclusion principle to sum over the bosonic diagrams in bare diagrammatic quantum Monte Carlo (dQMC) and inchworm Monte Carlo method, respectively. In the case of inchworm Monte Carlo, the proposed fast algorithm gives an extension to the work ["Inclusion-exclusion principle for many-body diagrammatics", Phys. Rev. B, 98:115152, 2018] from fermionic to bosonic systems. We prove that the proposed fast algorithms reduce the computational complexity from double factorial to exponential. Numerical experiments are carried out to verify the theoretical results and to compare the efficiency of the methods.


Introduction
Open quantum systems, which characterize quantum systems coupled with environment, have been studied extensively for many decades, as it arises in many context including quantum optics [9], quantum computation [31], and dynamical mean field theory [20], just to list a few. The coupling between the system and the environment leads to non-Markovian evolution of the quantum state of the system. In the weak coupling limit, such evolution can be approximated by the Markovian process described by the Lindblad equation [16,17], which simplifies the numerical simulation. In the more challenging case where memory effect has to be taken into account, a number of numerical methods have been proposed in the literature. For example, the quasi-adiabatic propagator path integral (QuAPI) [25,26] method assumes finite memory length and so that the path integral can be numerically computed iteratively; by assuming that the bath response function has a special form, the hierarchical equations of motion can be applied [38,39]; the method of multiconfiguration time dependent Hartree (MCTDH) [3] is developed based on ansatz of wave functions. While these deterministic methods require some additional modeling of the open quantum system, the bare diagrammatic quantum Monte Carlo (dQMC) method [22] applies Monte Carlo sampling to directly compute the summations and high-dimensional integrals in the Dyson series expansion of the quantum observable [37], and after applying Wick's theorem [30], this approach can be represented as the summation of all possible diagrams, each of which is determined by a finite time sequences and a partition of them into pairs. However, such technique may encounter the notorious numerical sign problem [10][11][12], meaning that the number of Monte Carlo samples is required to grow at least exponentially (with respect to physical time) in order to keep the accuracy of the simulation.
Recently, the inchworm Monte Carlo method [1, 12-14, 18, 32] was proposed to mitigate the numerical sign problem. It introduces bold lines as partial resummations of bare dQMC, so that the total number of diagrams can be reduced, and the sign problem is hence suppressed. This approach is further improved in [11] by writing the evolution of the bold lines as an integro-differential equation, which only requires to sum over "linked" diagrams, so that the computational cost can be further reduced. Even after such reductions, however, as the number of points in the time sequence m increases, the total number of diagrams still grows as a double factorial O((m − 1)!!). The Monte Carlo sampling of these diagrams will again contribute to the stochastic error, when a large m is needed.
One possible approach to reduce the stochastic error is to sum up all the diagrams with the same time sequence using a deterministic method. As a direct summation is prohibitive due to the large number of diagrams, it calls for designing better algorithms to circumvent the difficulty. The fermionic bath influence functional in the bare continuous-time hybridization expansion (CTHYB) [27,28,36,42], which is the counterpart of bare dQMC for bosons, can be calculated in the form of a determinant [27,42] and thus the computational cost can be reduced to O(m 3 ). In the inchworm method, which has less severe numerical sign problem, such a method cannot be directly applied as inchworm expansion only sums over the linked diagrams and thus corresponding bath influence functional cannot be written in a determinant form as in CTHYB.
The recent work [8] tackles this challenge with an inverted algorithm, which takes the idea of [33] that considers the sum of all linked diagrams at once and utilizes the massive cancellations between the diagrams, leading to an exponential rather than factorial computational complexity. It has been shown that the inverted algorithm can asymptotically achieve a computational cost at O(m 3 α m ), which is significantly smaller than the double factorial complexity for the direct summation of all linked diagrams. The work [8] also developed another algorithm based on inclusion-exclusion principle which is even more efficient in the sense that it can further reduce the constant α in the context of inchworm hybridization expansion. The inclusion-exclusion principle describes how the cardinality (or other measures) of unions of sets can be calculated, which is also well known through the Venn diagram. This principle has been applied in a number of areas to reduce the computational complexity, including set partitioning [7], counting perfect matchings [4] and computing matrix permanents [34]. The algorithm in [8] is designed by excluding all the unlinked diagrams from the set of all diagrams, where the set of all the unlinked diagrams is represented by the union of several non-disjoint sets. This allows the inclusion-exclusion principle to be applied to the summation of diagrams, resulting in significant reduction of the computational time.
In this work, we aim to generalize these efforts on fermionic cases to bosonic cases for the simulation of open quantum systems. For bare dQMC, instead of the determinant form, the bath influence functional now holds the form of the matrix hafnian [2,5]. By the inclusion-exclusion principle, we propose a fast algorithm with computational cost O(2 m ), which is efficient for small values of m (around m ≤ 20). We then further generalize the idea to inchworm method for bosonic systems so that the summation of diagrams with the same time sequence also requires only O(α m ) operations. A sharp estimation of α will also be provided for our algorithm.
The rest of this paper proceeds as follows: In Section 2, we introduce the Dyson series and use inclusion-exclusion principle to derive an algorithm which sums over the diagrams in the Dyson series efficiently. In Section 3, another fast algorithm based on inclusion-exclusion principle is designed to sum over the linked diagrams appearing in the inchworm method. An optimization for this algorithm is further proposed, and a complexity analysis is included to examine the computational cost of the optimized algorithm. Section 4 verifies these theoretical results by numerical experiments. Finally, we draw our conclusion in Section 5.

Dyson series with inclusion-exclusion principle
We study an open quantum system described by the von Neumann equation and Id s , Id b are respectively the identity operators for the system and the bath.
We are interested in the evolution of the expectation for a given observable O = O s ⊗ Id b acting only on the system part, defined by O(t) := tr(Oρ(t)) = tr(Oe −itH ρ(0)e itH ). ( Due to the high dimensionality of the space H b , it is usually impractical to solve e ±itH directly. One feasible approach is to apply the method of quantum Monte Carlo to approximate O(t) numerically. Below we will first introduce the bare dQMC based on the Dyson series expansion of O(t) , and then propose an efficient method to compute a key term in the expansion known as the bath influence functional.
2.1. Introduction to the bare diagrammatic quantum Monte Carlo method. Upon assuming the initial density matrix has the separable form ρ(0) = ρ s ⊗ ρ b where the bath ρ b commutes with the Hamiltonian H b , the expectation of observable O(t) can be represented by the following Dyson series (for derivation, see [11]): Here #{s < t} denotes the number of s i which are less than t and tr s takes trace of the system degree of freedom. In practice, one may truncate the series above at a sufficiently largeM and evaluate those high-dimensional integrals on the right-hand side using Monte Carlo integration, resulting in the bare dQMC. This requires us to evaluate the integrand in the Dyson series (3) for each sample. The explicit formula for the propagator U (0) in given in Appendix A, which contains the observable O s and is associated with the system space. As for the bath influence functional L b , we assume that the Wick's theorem can be applied so that where B : {(τ 1 , τ 2 ) | 0 ≤ τ 1 ≤ τ 2 } → C is the two-point bath correlation and the set Q(s 1 , · · · , s m ) is the collection of all possible ordered pairings of the time sequence (s 1 , · · · , s m ): , · · · , m}, j l < k l for any l = 1, · · · , m/2 .
For example, when m = 4, L b (s 1 , s 2 , s 3 , s 4 ) is given by In particular, when m = 0, the value of L b (∅) is defined as 1. With such expression of bath influence functional, the right-hand side of (3) only sums over the terms with even m. We may also express (6) using many-body diagrams: In the diagrammatic representation above, each diagram refers to a product B(·, ·)B(·, ·) where each arc connecting a pair denotes the corresponding two-point correlation.
The major challenge on evaluating L b (s 1 , · · · , s m ) (m is even) is that its diagrammatic representation includes in total (m − 1)!! diagrams, which leads to a double factorial growth in the computational cost on calculating such a bath influence functional via direct method (i.e., direct summation over each diagram in the expansion such as (7)). As this cost increases drastically when m gets larger, one needs to compute a given L b (s 1 , · · · , s m ) using the Monte Carlo method (on top the Monte Carlo sampling of (s 1 , · · · , s m )), leading to larger stochastic error. In this section, we will show how we can benefit from the well-known inclusion-exclusion principle to greatly reduce the complexity of computing the bath influence functional.

2.2.
Inclusion-exclusion principle for computing L b (s 1 , · · · , s m ). Mathematically, the equation (4) is known to be the hafnian of an undirected graph [2]. Several fast algorithms have been introduced to compute such quantity in the recent years [5,6,15,21,24,29]. While most algorithms aim for a good complexity for large values of m, here we are going to introduce a novel fast algorithm for computing hafnians with small m based on the inclusion-exclusion principle.
Given a function µ(·) satisfying the additivity such that where S is a given finite universal set containing A 1 , A 2 , · · · , A n . The inclusion-exclusion principle plays important roles in a number of fast algorithms such as Ryser's algorithm for matrix permanents [34] and the diagrammatic resummation of quantum impurity models [8]. To apply this principle on the evaluation of L b (s 1 , · · · , s m ), we set S as the collection of all combinations of m/2 distinct pairs from s 1 , · · · , s m and A i as all combinations of m/2 distinct pairs from s 1 , · · · , s m except s i : for any j = 1, · · · , m/2 , x j = s j 1 , y j = s j 2 ∈ {s 1 , · · · , s i−1 , s i+1 , · · · , s m } with j 1 < j 2 for any j = 1, · · · , m/2 .
We point out that Each diagram in the braces refers to a pairing in S or A i whose first and second components are represented by blue and red arcs respectively. For instance, we have ((s 1 , s 2 ), (s 1 , s 2 )) = , ((s 1 , s 2 ), (s 1 , s 3 )) = , · · · One can observe that s i (the point marked in green) never occurs in A i . Regardless of the blue/red color of arcs, all diagrams of A i are excluded from Q(s 1 , s 2 , s 3 , s 4 ), which is the collection of the diagrams in the diagrammatic representation of L b (s 1 , s 2 , s 3 , s 4 ). In fact, the union of A i contains all diagrams that are not in Q(s 1 , · · · , s m ), and therefore S\ m i=1 A i is the set formed by arranging all the pairings in Q(s 1 , · · · , s m ) in all possible orders. Denote a set function µ(·) as µ(E) := ((x1,y1),··· ,(x m/2 ,y m/2 ))∈E B(x 1 , y 1 ) · · · B(x m/2 , y m/2 ) for E = S, A 1 , · · · , A m the left-hand side of inclusion-exclusion principle (8) is then given by Here the combinatorial factor (m/2)! takes into account the possible permutations of pairs in ((s j 1 , s k 1 ), · · · , (s j m/2 , s k m/2 )) which all refer to the same element in the set.
On the other hand, the terms in the right-hand side of (8) are for l = 1, · · · , m − 2 and 1 ≤ k 1 < · · · < k l ≤ m. (10) Note that the intersection A k 1 ∩· · ·∩A k l does not include s k 1 , · · · , s k l and thus the i, j indices in the subscript of the summation can never be assigned as k 1 , · · · , k l . In addition, A k 1 ∩ · · · ∩ A k l is empty for l = m − 1, m, and therefore the corresponding values of µ(·) is zero.
The equation (11) can be used to calculate L b (s 1 , · · · , s m ) for given values of B(s i , s j ). Again in the example of m = 4, this equation can be expanded as It can be checked by direct calculation that the cancellations among these two-point correlations will finally lead to the same result as (6) from the definition of L b (s 1 , · · · , s m ), which justifies the equivalence between the two approaches to evaluate the bath influence functional. The formula (12) does not seem to hold any advantage at first glance. Indeed, for small values of m, the formula (11) requires more operations than the direct approach using the definition (4). However, the computational cost of such inclusion-exclusion principle will become significantly cheaper when m is large, which is m 0 Here the binomial coefficient m n is the number of terms in the summation in (11) with respect to k 1 , · · · , k n , and the binomial coefficient m−n 2 corresponds to the number of terms in the definition of Q k 1 ···kn . This cost grows considerably slower than the double factorial for the direct calculation of the bath influence functional. We note that the reduction of computational complexity can be compared to the Ryser formula [34] for computing the permanent of an m×m matrix, which is also derived from the inclusion-exclusion principle with the computational cost also being O(m 2 2 m ). In our case, we are able to further reduce the computational cost to O(2 m ) by calculating Q k 1 k 2 ···kn iteratively. Specifically, we define the symmetrization of the two-point correlation asB Below we useB ∈ C m×m to denote the symmetric matrix whose entries areB(s i , s j ). Based on such definition, the desired bath influence functional is then computed by the hafnian [2] of the symmetric matrixB, while Q and each Q k 1 k 2 ···kn are expressed in terms of the summation over entries ofB: Note that the constant 1 2 is needed here since we have taken into each term twice in the symmetrized summation. From this definition, we can observe that Q k 1 k 2 ···kn is a half of the sum over all the entries ofB excluding the k i th rows and columns for i = 1, · · · , n. Thus the value of Q k 1 k 2 ···kn can be obtained from the Q k 1 k 2 ···k n−1 by taking away the k n th row and column. We are then inspired to define which describes the summation over ith row and column except the k 1 th, k 2 th, · · · , k n−1 th entries. The underlined terms are subtracted since they do not exist in the sum Q k 1 k 2 ···k n−1 , and the diagonal entryB(s i , s i ) is counted twice but this does not matter since it equals zero by definition. Now we can compute Q k 1 k 2 ···kn by Note that this relation also holds for n = 1, for which the left-hand side of (14) becomes R (i) , denoting the sum of the ith row of the matrixB. The equation (15) reduces the computational cost of each Q k 1 k 2 ···kn to O(1) once the initial value Q is given, and each R (kn) k 1 k 2 ···k n−1 can also be obtained iteratively by only one subtraction: which can be easily seen according to its definition. For a more intuitive understanding, one may refer to Figure 1 to visualize the procedures to compute a simple example as Q 24 : Each node is assigned with the value of the corresponding entry of the matrixB (with the coefficient 1 2 ), so Q is simply the summation over all such nodes. We can reach to the desired Q 24 by the following two steps: (i) calculate R (2) (summation over the red lines) and obtain Q 2 (summation over all nodes that are not on red lines) using relation (15); (summation over the blue lines excluding the two boxed nodes) using relation (16) and get Q 24 (summation over all nodes that are on neither red nor blue lines) again by relation (15).
Note that the nodes on the green diagonal are all equal to zero, which explains why the double counting on the nodes of intersection on red/blue lines will not affect the result of the calculation as we have mentioned previously.
To end this section, we examine the computational cost of such procedures which are written in details as Algorithm 1: The major complexity concentrates in the calculation of R (i) k 1 k 2 ···kn and Q k 1 k 2 ···kn in Line 8 and 9, both of which require 1 subtraction in each iteration for the total 2 m iterations, and the evaluation of the final step in Line 13 whose cost is again 2 m . Other computations such as the initial settings for R (i) and Q need at most m 2 operations and thus are minor. Consequently, Algorithm 1 has the complexity at O(2 m ), which is much cheaper compared to the original (13). Algorithm 1 Inclusion-exclusion principle for computing L b (s 1 , · · · , s m )

5:
fork from 1 to i − 1 do 6: for n from 0 to max(min(m − 4,k − 1), 0) do 7: for 1 ≤ k 1 < k 2 < · · · < k n <k do 8: end for 13: end for 14: Compute L b (s 1 , · · · , s m ) according to (11) ⊲ Final step Compared with previous works on the computation of hafnians, this algorithm does not have the optimal time complexity. In [5], Björklund et al. have proposed an algorithm that computes the hafnian of a m × m matrix with time complexity O(m 3 2 m/2 ), which requires computation of all the eigenvalues of 2 m/2 matrices. Asymptotically, such an algorithm is faster than our algorithm for large m. However, according to our experiments, Algorithm 1 is not slower than Björklund's algorithm up to m = 22 due to a relatively smaller prefactor of the overall complexity, which is sufficiently efficient as a satisfactory convergence of Dyson series generally will not require a very large m. We have attached our MATLAB code for Algorithm 1 in Appendix B and one may compare the time efficiency of Björklund's algorithm with ours.

Inchworm Monte Carlo method with inclusion-exclusion principle
The bare dQMC method can be only applied to short-time simulation, since the variance of the integrand in the Monte Carlo method grows exponentially with simulation time, which is known as the dynamical sign problem [27,28,35]. One approach to alleviate the sign problem is the inchworm Monte Carlo method proposed in [12], which introduces the full propagator G(s i , s f ) defined by (see [11] for a derivation) for the initial time point s i ∈ [0, 2t]\{t} and the final time point s f ∈ [s i , 2t]\{t}. One may compare the definition of such a full propagator with the desired expectation of observable (3) to find the relation O(t) = tr s (ρ s G(0, 2t)), suggesting that we should obtain O(t) by studying the evolution of G(s i , s f ).
In [11,Section 4], an integro-differential equation formulation for the full propagator is proposed as Here we recall that W s is the perturbation associated with the system, and the functional U is defined similarly to U (0) in (17) with the bare propagator G (0) (·, ·) replaced by the full propagator G(·, ·). Its formula together with some important properties of the full propagator are summarized in Appendix A.
The definition of L c b is similar to (4): where Q c denotes the set of linked pairings: By saying q is "linked" in the diagrammatic representation, we mean that all points in a diagram are connected with each other using arcs as "bridges". In the same example (7) as when m = 4, the second diagram on the right-hand side is considered to be linked since one may start from any of the four points and reach to any other one going through the path formed by the union of the arcs. More rigorously, this linkedness is defined as follows.
Definition 1 (Linked pairs). Two pairs of real numbers (s 1 , s 2 ) and (τ 1 , τ 2 ) satisfying s 1 ≤ s 2 and τ 1 ≤ τ 2 are linked if either of the following two statements holds: Definition 2 (Linked sets of pairs). Given two sets of pairs q 1 and q 2 , we say q 1 and q 2 are linked if there exists (s 1 , s 2 ) ∈ q 1 and (τ 1 , τ 2 ) ∈ q 2 such that (s 1 , s 2 ) and (τ 1 , τ 2 ) are linked. We say a given set of pairs q is linked if it cannot be decomposed into the union of two sets of pairs that are not linked.
When two sets of pairs q 1 and q 2 are linked, we also say that q 1 is linked to q 2 and vice versa.
For example, the first diagram on the right-hand side of (7) is not linked since it can be decomposed into q 1 ∪ q 2 where q 1 = {(s 1 , s 2 )} and q 2 = {(s 3 , s 4 )} and obviously q 1 is not linked to q 2 . For the same reason, the third diagram is not linked either. Therefore, only the second diagram is linked and which is diagrammatically expressed as = , where L c b is denoted by a rounded box covering time sequence (s 1 , · · · , s m ). Another example for this many-body diagrammatic representation when m = 6 is given below: We observe that the rounded box above only contains four linked diagrams, while the corresponding bath influence functional L b (s 1 , s 2 , s 3 , s 4 , s 5 , s 6 ) is the summation of all possible ordered pairings including diagrams like which are not linked. Compared with the Dyson series, the advantage of the equation (18) is that its series with respect to m has a faster convergence than that in (3), leading to a less severe numerical sign problem. Also, the number of diagrams in L c b (s 1 , · · · , s m ) grows asymptotically as e −1 (m − 1)!! [37], which is less than the number of diagrams in L b (s 1 , · · · , s m ). To solve (19) numerically, one can apply Runge-Kutta type methods to discretize the time s f , and the integrals on the right-hand side of (18) are approximated by the Monte Carlo method, especially for large m. Thus, it can be expected that most of the computational time is spent on the evaluation of L c b (s 1 , · · · , s m ) defined in (19), and hence, a fast algorithm for L c b (s 1 , · · · , s m ) is desirable. Below, we are going to combine the result in Section 2.1 and the technique developed in [8] to accelerate the computation of L c b for large m.
3.1. Inclusion-exclusion principle for computing rounded box L c b (s 1 , · · · , s m ). Since any rounded box covering two points can be directly evaluated by the corresponding two-point correlation, we only consider the calculation of L c b (s 1 , · · · , s m ) with even m ≥ 4 in the rest of of this section. In the inclusion-exclusion principle (8), we set S = Q * (s 1 , · · · , s m ) := {(s j 1 , s k 1 ), · · · , (s j m/2 , s k m/2 )} {j 1 , · · · , j m/2 , k 1 , · · · , k m/2 } = {1, · · · , m}, k l − j l ≥ 4 for any l = 1, · · · , m/2 (23) equipped with the set function In (23), Q * (s 1 , · · · , s m ) is similarly defined as Q(s 1 , · · · , s m ) in (5) but does not include any pair formed by two adjacent time points, i.e., we do not consider any pair (s j l , s k l ) with k l − j l = 2 in S. This difference in definition is based on the fact that an ordered pairing with an arc connecting two adjacent points such as the ones in (22) will never be linked, and thus is not considered in a rounded box. Note that S = ∅ when m = 2, which explains why we restrict our discussion for m ≥ 4 in this section. Upon further introducing Diagrammatically, we express a given L * b (s 1 , · · · , s m ) by a rectangular box. For example when m = 6, we have Similar as the bath influence functional L b (s 1 , · · · , s m ), the rectangular box above contains all linked diagrams (the first four diagrams on the right-hand side). However, L * b (s 1 , · · · , s m ) has only one unlinked diagram (the last diagram) and does not include any unlinked diagrams with adjacent pairs. Such idea has also been applied in [8] referred as "second optimization" to eliminate some unlinked diagrams that will not be used in inchworm method. By writing L * b (s 1 , · · · , s m ) as the last formula of (24), we can again apply Algorithm 1 to compute a given m−point rectangular box at the complexity of O(2 m ) upon setting all entries on the subdiagonal and superdiagonal of the matrix in Figure 1 to be zero.
To continue the inclusion-exclusion principle, we further let given V := (s i+1 , · · · , s i+2n ) being a subsequence of (s 1 , · · · , s m−1 ), where we have used the short-hand notation Q * (V ) to denote Q * (s i+1 , · · · , s i+2n ). By saying q V is a linked component of q, we mean q V ⊂ q is linked but not linked to q\q V . Each A V is a collection of some unlinked diagrams since each of its elements q can be decomposed as All linked q V (marked in red) in the corresponding A V are eventually included in the rounded boxes which are calculated as L c b (V ). The rest of points in {s 1 , · · · , s 10 } are not linked to q V and they build up all possible ordered pairings without any pair of adjacent two points. Therefore, we group these points in the rectangular boxes and compute them by L * b ({s 1 , · · · , s 10 }\V ). Note that some rectangular boxes may be divided by rounded boxes into several nonadjacent segments and we use "thin pumps" to connect these segments above the rounded boxes to indicate that the points covered by these segments are in the same L * b . Consequently, the diagrams on the right-hand side above are expressed as the following formulas: The union of A V contains all the diagrams in S which are not linked. Note that in (29), we have neglected the diagram with 8−point rounded box µ(A {s 1 ,··· ,s 8 } ), which equals zero as its rectangular box is formed by the two adjacent points and thus L * b (s 9 , s 10 ) = 0. Moreover, in the definition of A V in (23), we do not need to consider the case where V includes the last time point s m . This is because, given any V ′ = {s m−2n+1 , · · · , s m }, if we define A V ′ as the set of diagrams with a linked component including all the last 2n points, then each element in A V ′ can be found in at least one of the A V 's defined in (23). For example, the five diagrams of A {s 7 ,s 8 ,s 9 ,s 10 } given by On the right-hand side of (8), if the sets V 1 , · · · , V l are mutually disjoint, we have if any V i and V j contain a common point s i , then A V 1 ∩ · · · ∩ A V l = ∅, which has no contribution in the inclusion-exclusion principle. Thus, in the example of m = 10, the following intersections provide nonzero contribution: We have again neglected the intersection µ A {s 1 ,s 2 ,s 3 ,s 4 } ∩ A {s 5 ,s 6 ,s 7 ,s 8 } which contains the null-valued rectangular box L * b (s 9 , s 10 ). Now we insert (27)- (29) and (33) into (8) to get the diagrammatic representation for L c b (s 1 , · · · , s 10 ) using inclusion-exclusion principle as At this moment, we have obtained an indirect approach based on inclusion-exclusion principle to evaluate a given rounded box L c b (s 1 , · · · , s m ): One may first expand a rounded box as (34) and then compute the (bridged) rectangular boxes using formula (11) (or more efficiently Algorithm 1). Each rounded box on the right-hand side with length greater than 2 can again be evaluated by the same procedure. In the subsequent section, we will further optimize the computational complexity of the calculation by an improved algorithm.
3.2. Improved algorithm. The main idea of our improved algorithm is to combine the diagrams with the same rectangular boxes. Specifically, in the example (34), the first two diagrams in the last line have the same rectangular box, and they can be written together by the distributive law as [L c b (s 2 , s 3 , s 4 , s 5 )L c b (s 6 , s 7 , s 8 , s 9 ) − L c b (s 2 , · · · , s 9 )]L * b (s 1 , s 10 ). For simplicity of notations, we define the dotted box: In general, a dotted box with even time points represents the sum of all partitions of these time points by rounded boxes with length greater than or equal to 4, and the sign of each term depends on the number of rounded boxes (even for + and odd for −).
Two more examples are given below: With the first two diagrams in the last line of (34) replaced by (35), the number of diagrams to be summed over can be reduced by 1 and we now compute L c b (s 1 , · · · , s 10 ) as = (zero "•"s in ) + + · · · · · · + (four "•"s in ) + + · · · · · · + (six "•"s in ) + + (eight "•"s in ) (36) For rounded boxes with more points, more diagrams can be reduced. If the complexity of evaluating each dotted box is not more expensive than a rounded box with the same length, a considerable reduction in total computational cost using this optimization can then be expected for a large m. In the following subsection, we consider an efficient approach to compute the dotted boxes.

Iterative method for computing dotted boxes.
For any even integer m ≥ 4, we denote each dotted box by L d b (s 1 , · · · , s m ), which is calculated in formulas as m points , · · · , s m ). (37) Note that the multiple summation in the second line above becomes one single rounded box L c b (s 1 , · · · , s m ) when the index j = 0.
For a more efficient implementation, we compute a given dotted box iteratively based on the previous results of shorter dotted boxes. For example when m = 12, we have = − where one can see that the computation requires only 3 multiplications and 3 subtractions.
More generally, such iteration is described by the lemma below: Proof. We consider the following resummation of the original definition (37) by restricting the "length" of the last rounded box in the multiple summation, i.e., the term L c b (s i j +1 , · · · , s m ): By decreasing the index j in each term by 1, one may easily check these multiple summations will coincide the definition (37) for shorter dotted boxes, i.e., , which proves (39).
By comparing the number of diagrams that need to be summed up for a rounded box in the example (34) with that for a dotted box in (38), one can easily see that computing a dotted box for a large m using the above iterative method is even cheaper than computing a rounded box of the same size. Later in Section 3.3, we will carry out a complexity analysis on the computational cost of these dotted boxes in the entire algorithm. Now we are ready to formulate the expansion of rounded boxes L c b (s 1 , · · · , s m ) for an arbitrary even m using only dotted and rectangular boxes, and propose an optimized algorithm to compute the rounded boxes.

3.2.2.
Inclusion-exclusion principle with optimization for computing rounded boxes. The example in the previous section suggests that a rounded box is computed by the summation of all possible diagrams filled up by the nonadjacent dotted boxes covering at least four points excluding the right end and a rectangular box covering rest of the time points. The formula is provided in the following theorem: Theorem 2. Given the increasing time sequence (s 1 , · · · , s m ) with m ≥ 4 being an even number, we have where k = ⌊ m 5 ⌋. The "rest of points" denotes all time points in (s 1 , · · · , s m ) which do not occur in the brackets of any L d b in the same summand. We remark that on the right-hand side of (40), the number of dotted boxes L d b (· · · ) in each summand does not exceed k = ⌊ m 5 ⌋ since all dotted boxes are pairwise nonadjacent and each of them includes at least four points. For example, the diagrams or are not allowed. Now we arrive at an optimized algorithm based on inclusion-exclusion principle to calculate a rounded box L c b (s 1 , · · · , s m ): One writes the rounded box as the expansion (36) using Theorem 2 and then apply Algorithm 1 to calculate the rectangular part and Lemma 1 for the dotted segments. To avoid repeated calculations caused by recursion, one should compute all rounded segments from short to long until the entire rounded box is obtained. Such procedures in general are described by Algorithm 2.
We have now finished the implementation of inclusion-exclusion principle for computing the functional L c b (s 1 , · · · , s m ). Similar as computing the bath influence functional, inclusionexclusion principle for the rounded box is less efficient than the direct summation of all linked diagrams for a small m. However, our complexity analysis in the next section will show that the new algorithm will outperform the direct method as m becomes large. The central idea is that, when m increases, the number of diagrams in (36) will grow significantly slower than double factorial (the growth rate of the number of diagrams in the direct method).
The proposed algorithm in this section can be regarded as the bosonic version of the algorithm introduced in [8] for fermions. We have also further improved the algorithm by a more efficient scheme to compute the dotted boxes (Section 3.2.1).  Compute L c b (s k , · · · , s k+2n−1 ) according to (40) where each L * b (rest of points) is computed according to Algorithm 1 8:

Complexity analysis.
In this section, we will show that the computational complexity of Algorithm 2 based on inclusion-exclusion principle is significantly smaller than double factorial, which is the growth rate of the direct summation over all linked diagrams. We denote the complexities of evaluating 2n−point rounded and dotted segment respectively by C rd (2n) and C dt (2n). The total computational cost for L c b (s 1 , · · · , s m ) using Algorithm 2 can be immediately written down as where the last term above refers to the cost of the final step in Line 11 of Algorithm 2. Based on the previous calculations on the shorter segments, the computational complexity of a dotted box C dt (2n) is simply given by according Lemma 1. Therefore, we focus on the estimation of the complexity of rounded segments C rd (2n). Inspired by the example (36), the computational cost for L c b (s k , · · · , s k+2n−1 ) in Line 7 using Theorem 2 can be estimated as In the estimation above, a 2n,2k is the number of diagrams where the total length of the dotted boxes is 2k. C b (2n − 2k) is the computational cost of a rectangular box with length 2n − 2k, which is at O(2 2n−2k ) as we have discussed at the end of Section 2.2. "k" and "1" respectively counts the multiplications used among the rectangular part and dotted boxes within each diagram, and the addition between every two diagrams. For example, the last two terms in the right-hand side of (36) read + whose computational cost contributes to the k = 4 term in the summation (43) for C rd (10).
In each of the a 10,8 = 2 diagrams above, we compute a two-point rectangular box and need at most two multiplications (for the second diagram). We further claim that a 2n,2 = 0 since a two-point dotted box including two adjacent points always has zero value, and thus there will not exist any diagram containing a two-point dotted segment in inclusion-exclusion expansion of a given rounded box. At this point, we only need to focus on the estimation for a p,2k (p can be odd), which essentially is the number of nonadjacent partitions over the integers from 1 to p − 1 (the last point is excluded), where each dotted segment covers at least four points and the total length of all dotted segments is 2k. The following statement provides a useful recurrence relation for the sequence {a p,2k }: Lemma 2. Given integers p ≥ 1 and 0 ≤ k ≤ ⌊ p−1 2 ⌋, the sequence {a p,2k } satisfies the recurrence relation a p,2k = a p−1,2k + (a p−5,2k−4 + a p−7,2k−6 + a p−9,2k−8 + · · · + a p−(2k−3),4 ) + 1.
where all diagrams have the same length p. Since all the sets on the right-hand side are disjoint (which can be observed by focusing on the last dotted box of the diagrams), we immediately get (44) by counting the number of diagrams on both sides. To show (47), we can take any diagram on the left-hand side, and check the status of the last second point: • If the last second point is not included in any dotted boxes, then the diagram must belong to the first set on the right-hand side of (47); • If the last second point is included in a 2n-point dotted box (1 < n < k − 1), then the diagram must belong to the nth set on the right-hand side of (47); • If the last second point is included in a 2k-point dotted box, then this diagram must be the one in the last line of (47).
It is now clear that the left-hand side of (47) is a subset of its right-hand side. For the reverse direction, the right-hand side is obviously a subset of the left-hand side since any diagram in any set on the right-hand side of (47) has in total 2k points in dotted boxes, and the last point is never included into any boxes. This completes the proof of the lemma.
By (48), we see that which shows that in the Maclaurin expansion of f (x, 2), the coefficient of x 2n equals the the first summation in the second line of (50). According to (49), the function f (x, 2) is a rational function, so that its Maclaurin expansion can be found via the Heaviside cover-up method [40]: 4461i and x 4,5 ≈ 0.7348 ± 0.62649i are the poles of f (x, 2) and c i are some constants. Therefore, asymptotically we have Similarly, the second summation in the last line of (50) is the coefficient of x 2n in the Maclaurin expansion of f (x, 1) and we can deduce that n k=0 a 2n,2n−2k ∼ O (1.44327 2n ). Hence, C rd (2n) 2.12577 2n + n · 1.44327 2n ∼ O(α n ) with α ≈ 4.51891.
Afterwards, we insert the above upper bound back into (41) to obtain the overall complexity C opt (m) of Algorithm 2: The fraction in the last line above is asymptotically O(α m/2−1 ) and thus the second term dominates the upper bound. Such estimation indicates that the major computational cost of the algorithm is spent on the rectangular boxes in the final step (Line 11) when calculating the longest rounded box. To summarize the analysis, we state the conclusion in the theorem below: Theorem 3. Given the increasing time sequence (s 1 , · · · , s m ) with m being an even number, the complexity of Algorithm 2 computing the entire rounded box L c b (s 1 , · · · , s m ) can be bounded by C opt (m) α m/2 with α ≈ 4.51891.
Compared to the direct method whose computational cost grows as fast as double factorial in m, the inclusion-exclusion principle based algorithm with exponential growth rate is obviously more efficient when m is large.

Numerical experiments
In this section, we will first numerically verify the statements on the complexities of algorithms, and then simulate both bare dQMC and inchworm Monte Carlo method to see how we can benefit from the inclusion-exclusion principle in applications.
We consider the spin-boson model [19,23,41] where the Hamiltonian and perturbation operators associated to the system are H s = ǫσ z + ∆σ x , W s =σ z whereσ x andσ z are the usual Pauli matriceŝ The observable of interest is set to be O =σ z ⊗ Id b , which meets the condition that O only acts on the system space. The initial density matrix ρ = ρ s ⊗ ρ b is given by where Z is a normalizing factor chosen such that tr(ρ b ) = 1. Assume a Ohmic spectral density, the two-point correlation function is formulated as where ∆τ is the time difference on the Keldysh contour defined as and the coupling intensity c l and frequency of each harmonic oscillator ω l are given by In our experiments, we will study two examples with the parameter settings listed below in Table 1. As one can observe from Figure 2, B(τ 1 , τ 2 ) under the two parameter settings both decay to zero for large time difference, which guarantees the convergence of the Dyson series as well as the infinite series in the inchworm integro-differential equation (18). In Case 2, the decay of |B(τ 1 , τ 2 )| is slower than Case 1, leading to a slower convergence of the Dyson series and the inchworm series. It can then be expected that larger m needs to be included in the simulation of Case 2. In this section, we compare the wall clock time on evaluating given L b (s 1 , · · · , s m ) and L c b (s 1 , · · · , s m ) using direct summation and Algorithms 1 and 2 based on the inclusion-exclusion principle. The experiments are carried out using MATLAB on Intel Xeon CPU X5650 and the results for the time consumed may vary for different hardware, programming languages and implementation details. Since the operation counts do not depend on the value of B(τ 1 , τ 2 ), we will only use the parameters for Case 1 in our test throughout this section.  The computational time for a various choice of m is plotted in Figure 3. In the left panel, we compare direct method with Algorithm 1 for computing a given bath influence functional. As predicted, the efficiencies of the two algorithms are comparable for small order m. Starting from m = 12, however, due to the double factorial growth in complexity, the time cost for the direct method becomes obviously larger than the inclusion-exclusion principle whose growth rate is only exponential as O(2 m ) according to our discussion at the end of Section 2. The right panel of Figure 3 compares Algorithm 2 with the direct summation over the linked diagrams to compute a rounded box. Algorithm 2 outperforms the direct method when m ≥ 16. The dotted line represents our estimation of the growth rate O(α m/2 ). We can observe that the curve of the inclusion-exclusion principle gradually becomes parallel to the dotted line, as indicates that our estimation of the computational complexity is sharp.
We would also like to discuss the memory cost of the algorithms. The direct method is out of memory for our machine once entering the red region (i.e., m > 20) and thus the results are not presented. In our simulation, to implement the direct method efficiently, we first generate all the linked diagrams and store their configurations in the memory, so that the computational time presented in Figure 3 can be minimized. To store the diagrams, we use a matrix of size A m × m to record the indices of the time sequence in all linked diagrams. Here A m represents the number of diagrams, and m denotes the length of the rounded box. For example, all the diagrams included in L c b (s 1 , s 2 , s 3 , s 4 , s 5 , s 6 ) (see (21)) are stored in the following 4 × 6 matrix where each row of the matrix describes the pairing of time points (arcs) in one diagram on the right-hand side of (21). However, the size of this matrix grows quickly as m increases due to the double factorial growth of the number of diagrams. For example, when the length of a rounded box reaches m = 22, the size of matrix turns out to be 4342263000×22. Even if we use the uint8 data type in MATLAB (the smallest unsigned integer type that takes only one byte) to store the matrix, the total memory cost is around 89G, which is beyond the capacity of most machines. A workaround is to further compress the matrix using more compact storage patterns, or generate the diagrams during the summation. Both approaches will cause additional operations so that the computation may be further slowed down. As a comparison, the major memory cost for inclusion-exclusion principle concentrates in the temporary storage of complex-valued Q k 1 k 2 ···ks and R k 1 k 2 ···ks appearing in Theorem 1, which grows only as an exponential and the memory requirement is at most 16× 2 m+1 1000 3 G (only 0.1344G for m = 22). Here 16 refers to the number of bytes for a double-precision complex number, and 2 m+1 refers to the total number of entries in Q k 1 k 2 ···ks and R k 1 k 2 ···ks . As a result, a longer diagram can be computed using the algorithm based on the inclusion-exclusion principle. Such a memory issue for the direct method also exists when computing L b (s 1 , · · · , s m ) since the number of diagrams given by a bath influence functional is even larger than that in a rounded box of the same size.

4.2.
Numerical simulations for open quantum systems. We now implement several numerical simulations on the observable σ z (t) using bare dQMC and inchworm Monte Carlo method respectively, in which inclusion-exclusion principle will be used to evaluate large rectangular and rounded boxes. We would like to check if and how frequently we will encounter the scenarios when the order m has to be chosen very large during the simulations to show the necessity of exploiting the inclusion-exclusion principle.

Numerical methods.
We first introduce the numerical methods for our simulations. In particular, we will discuss the details of the implementation of bare dQMC. For the more complicated inchworm equation, we only provide our Monte Carlo sampling method, which is novel in this work. One may refer to [10,11] for the general framework of the full implementation.
In the Dyson series (3), m should be chosen as even due to the Wick's theorem for the bath influence functional. Moreover, the m = 0 term in the Dyson series does not contain any time points and thus no Monte Carlo sampling is needed when applying bare dQMC. Therefore, we may take out this term and rewrite (3) as To approximate the infinite series in the above formula using Monte Carlo integration, we need sample • a positive even number m; • a sequence of times: 0 ≤ s 1 ≤ s 2 ≤ · · · ≤ s m ≤ 2t.
Once m is chosen, the time sequence (s 1 , s 2 , · · · , s m ) can be generated by drawing a sample from the uniform distribution U ([0, 2t] m ) and then sorting the sequence. In our previous works [10,11], instead of sampling the even number m, we simply truncated the series (3) at m =M and use the same number of samples for each m. In the current paper, we would propose a heuristic approach to take samples of m. Ideally, the probability of m should be proportional to the absolute value of the integral in (3). Since such a function is not available, we make the following approximations: • Ignore the term tr s (· · · ) representing the system part; • Use the uniform distribution of s 1 , · · · , s m to represent the value of L b (s 1 , · · · , s m ) in all cases.
Thus the distribution of m becomes where τ = 2t 2M +1 and λ 0 is given such that the normalization Mmax M =1 P t (m = 2M ) = 1 holds. Here we set M max to be the maximum value of M in order to prevent m from being too large, which may cause unnecessary huge computational cost in the evaluation of the bath influence functional. Thereafter, the bare dQMC approximates the observable σ z (t) as where N s is the number of samples, and the quantities with superscript (j) denote the jth sample.
In order to study the evolution of the observable σ z (t) in the time interval [0, T ], we will need to compute all σ z (nh) for n = 1, 2, · · · , T /h given the time step h (the initial value is σ z (0) = tr(σ z ρ s ) = 1 according to the definition (2)). Therefore, we need to first generate P t for all t = h, 2h, · · · , T , which requires the calculation of long L b including up to 2M max time points. This can be time-consuming when the time step h is small, and it is also unnecessary for short time simulations where a large m (j) is unlikely to be sampled. To improve the efficiency of simulations, we consider a more accessible distribution to approximate P t . In (52), we insert the definition of the bath influence functional and reach to where the constant B is some average of the two-point correlation. Inspired by this formulation, we set B to be a constant and choose the probability mass function to beP t (m = 2M ) = λ −1 1 (2Bt 2 ) M /M !, where λ 1 is chosen such that the normalization Mmax M =1P t (m = 2M ) = 1 holds. Thus each sample m (j) can be drawn based on the Poisson distribution. More precisely, we sample m according to Note that the "−1" is needed on the left-hand side above since a standard Poisson distribution samples nonnegative integers from 0 while M begins with 1. It remains only to set a suitable value for B ∈ (0, max |B|). Here we simply select B such that the probability mass function of m is close to (52) for t = T . For example in Figure 4, one can compare the three probability mass functions of Poisson distributions with different B for the numerical example Case 1 to see that the yellow dashed-dotted line gives a satisfactory approximation to P t . Poisson distributions with some other B are plotted as references, which are comparatively far away from the target blue line. Therefore, we set B = 0.2 for the Poisson distribution in Case 1. To end this section, we provide a brief discussion on the key procedures of the sampling method in the implementation of the inchworm Monte Carlo method. In (18), the partial derivative ∂/∂s f on the left-hand side is discretized by a certain time integrator such as Heun's method. As for the right-hand side, similar to the bare dQMC, we need to approximate the infinite series by sampling an even number m and the time sequence (s 1 , · · · , s m−1 ) at every time step. The time sequence is again sampled according to the uniform distribution U ([s i , s f ] m ), and the probability mass function of m is analogous to (52): where τ = t M . To avoid the expensive computations of the long rounded boxes L c b , we also sample each m (j) applying the Poisson distribution (54) in practice, where the choice of B is subject to a satisfactory approximation to the distribution (55), which is set to be B = 0.2 and B = 0.3 for Case 1 and Case 2, respectively. We refer the readers to Figure 5 for a comparison between the Poisson distribution and the distribution (55) for t = T .

Numerical results.
With the numerical methods introduced, we are now ready to present the results of our simulations on the time evolution of the observable σ z (t) . We are particularly interested in the convergence of σ z (t) computed by both bare dQMC and inchworm method w.r.t. the order m. Specifically, we first perform the simulation with the series in (17) or (18) truncated at m =M , and plot the real part of σ z (t) up to T = 2.5. Note that due to the numerical error, the computed σ z (t) may contain a nonzero imaginary part. We hope to observe the convergence of these results to the numerical solution using our approach introduced in Section 4.2.1, which justifies our numerical method. In our simulation, the Poisson distribution is truncated at M max = 13, so that the maximum value of m is 26.  Figure 6 plots the numerical results for parameter setting Case 1 using bare dQMC. We set the time step to be h = 0.1 and compute σ z (nh) for each n = 1, · · · , 25. For the result with m sampled by Poisson distribution, each σ z (nh) is calculated based on N s = 10 8 Monte Carlo samples. As for the results with fixed truncationM , we evaluate each m-dimensional integral in (3) using N s = 2 × 10 7 Monte Carlo samples. One can observe that the curve of observable tends to converge asM grows. However, significant difference can still be observed between the results forM = 6 andM = 8, indicating that larger m needs to be taken into account to get  Figure 6. Evolution of Re σ z (t) for parameter setting Case 1 by bare dQMC reliable results, and thus considering m also as a random variable turns out to be an efficient way to find suitable number of samples. With this approach, larger m will be encountered in the simulation, and we have listed in Table 2 the number of large m (within the red region of Figure 3) sampled by the Poisson distribution, which also represents the number of m-point bath influence functionals evaluated in the entire simulation. For example, we need to compute 1820 independent L b s (j) 1 , · · · , s (j) 26 for Monte Carlo integration using inclusion-exclusion principle. The evaluation of such high-order bath influence functionals is hardly feasible using the direct method.  As for the simulations by inchworm Monte Carlo method, we refer to Figure 7 for the numerical results with both parameter settings Case 1 and Case 2. The time step is again set as h = 0.1, while the number of samples is chosen as a relatively smaller N s = 10 5 (N s denotes the total number of samples used in the simulation by Poisson distribution, and the number of samples used for each (m − 1)-dimensional integral in the simulations with fixed truncation) since the numerical error of inchworm method is generally smaller than that of classic Dyson series [10]. For Case 1, the curve with fixedM becomes almost identical toM = 6 thanks to the rapid convergence of inchworm method. The result by bare dQMC using Poisson distribution (same as the solid line in Figure 6) is given as a reference. This indicates that inchworm Monte Carlo method can provide a satisfactory approximation to the exact solution with a small truncationM and hence outperform bare dQMC for this set of parameters. As no larger m is needed, there is no need to apply the inclusion-exclusion principle in this case. However in Case 2, the inchworm Monte Carlo method also suffers from slow convergence due to the slow decay of the two-point correlation function (see Figure 2). The right panel of Figure 7 shows that the discrepancy betweenM = 6 andM = 8 is still noticeable. With the adaptive choice of m, we are able to obtain results in good agreement with the reference results provide by bare dQMC with 10 8 samples for each σ z (nh) . Again, we list in Table 3 the number of samples involving large m in this experiment to show that inclusion-exclusion principle is indispensable to the calculation of long rounded boxes that direct method cannot deal with.

Conclusion
We have proposed fast algorithms based on inclusion-exclusion principle to sum diagrams appearing in the bare dQMC and inchworm Monte Carlo method. For bare dQMC, we have developed a formula to efficiently evaluate the bosonic bath influence functional at the cost of O(2 m ). Note that in the fermionic case, the bath influence functional becomes a determinant [27,42], while in the bosonic case, the computational cost is higher, but it turns out that the computational complexity is lower than the Ryser's algorithm for matrix permanents. For the inchworm method, our algorithm calculating the sum over linked diagrams can be considered as an extension to the work [8] which deals with the fermionic quantum impurity models. By a detailed complexity analysis, we have proved that the new algorithm reduces the computational cost from the original double factorial to exponential. More precisely, we estimate the computational complexity as O(α m/2 ) where α ≈ 4.51891, which has been also verified by our numerical experiments. Moreover, numerical simulations for the spin-boson model have been implemented to show the advantages of our approaches. A.2. Definition of U . The functional U in the integro-differential equation (18) is given by U (s i , s 1 , · · · , s m−1 , s f ) = G(s m−1 , s f )W s G(s m−2 , s m−1 )W s · · · W s G(s 1 , s 2 )W s G(s i , s 1 ), where the full propagator G(s i , s f ) is defined by , if s i < t s f with the propagator associated with the bath if s i s f < t, Below we provide our MATLAB code to compute the hafnian of a symmetric matrix B. The input matrix needs to be a symmetric square matrix with all diagonal entries being zero.