Sparse Modeling in Quantum Many-Body Problems

This review paper describes the basic concept and technical details of sparse modeling and its applications to quantum many-body problems. Sparse modeling refers to methodologies for finding a small number of relevant parameters that well explain a given dataset. This concept reminds us physics, where the goal is to find a small number of physical laws that are hidden behind complicated phenomena. Sparse modeling extends the target of physics from natural phenomena to data, and may be interpreted as"physics for data". The first half of this review introduces sparse modeling for physicists. It is assumed that readers have physics background but no expertise in data science. The second half reviews applications. Matsubara Green's function, which plays a central role in descriptions of correlated systems, has been found to be sparse, meaning that it contains little information. This leads to (i) a new method for solving the ill-conditioned inverse problem for analytical continuation, and (ii) a highly compact representation of Matsubara Green's function, which enables efficient calculations for quantum many-body systems.


Introduction
A small number of physical laws exist behind apparently complicated behaviors: This is the basic notion of physics. In condensed matter physics, we expect the existence of simple laws that approximately explain the behavior of ensembles in some parameter region. In this context, finding simple laws means finding an effective model that well explains behaviors of physical quantities with simple, hopefully mean-fieldlevel, calculations.
In the field of data science, one of the main goals is to find features that discriminate different kinds of data efficiently. The use of fewer features makes it easier to understand what is happening. Moreover, a small set of features is more flexible for describing a wide range of data than a large set of features designed to fit a specific dataset. Sparse modeling offers methodologies for this purpose. Solving an optimization problem, one can find essential parameters (cause) from a complicated dataset (effect) just as physicists find relevant physical laws from complicated natural phenomena.
Technically, sparse modeling treats inverse problems. Consider a well-defined mapping rule x ↦ y, where y is known. The inverse problem is to derive x for a given y. In practical situations, however, this inverse transformation is often difficult to perform because the observed data y may be disturbed by noise or the inversion may, in principle, not be uniquely defined. A standard strategy for inferring a seemingly correct solution relies on prior knowledge, which compensates for the lack of information for the inversion.
Which prior knowledge leads to a plausible solution? The maximum entropy method (MEM) uses the prior knowledge that x should be close to an "ideal solution" that carries all the desired features. On the other hand, sparse modeling takes advantage of sparsity; that is, it assumes that the solution x has only a small number of non-zero components. Therefore, y is fitted with a small number of components in x even if agreement with y is sacrificed.
A brilliant success of the sparsity criterion has been demonstrated in the applications to MRI. [1][2][3][4][5] With this criterion, incomplete signals measured in the Fourier domain can be stably transformed into a real-space image that looks as if complete signals had been used. A recent observation of a shadow of a supermassive black hole relied on data analysis that included sparse modeling. [6][7][8][9][10][11] These successful applications hint at even wider applicability of sparse modeling beyond data analysis of measurements.
In the second half of this paper, we present the application of sparse modeling to quantum many-body problems. The first question from an sparse modeling point of view is: What is sparse in quantum many-body theory? Recent investigations have shown that the information handled by the imaginary-time ( it) framework is sparse. More precisely, the information that the imaginary-time (Matsubara) Green's function GðÞ can carry is quite limited. Analytically, the imaginary-time representation GðÞ and real-frequency representation G R ð!Þ are equivalent in the sense that the transformation from one to the other preserves information. In practice, however, GðÞ is more fragile to noise, meaning that the numerically computed GðÞ has lost a large part of the information on G R ð!Þ. Therefore, GðÞ is sparse.
Given that the numerically computed GðÞ contains less information on real-frequency dynamics, we now turn our attention to how to extract the relevant information. This can be achieved using a new basis set, which is placed as an intermediate representation (IR) between the imaginary and real frequencies. With the IR basis, the sparse-modeling technique leads to a new algorithm for conversion (analytical continuation) from GðÞ to G R ð!Þ for efficient computation in quantum many-body theories.
The rest of this review paper is organized as follows. In Sect. 2, the fundamentals of sparse modeling are presented. Section 3 focuses on the technical details of the sparse modeling. Readers may skip this section, if numerical calculations are not of interest. Selected applications are reviewed in Sect. 4, with particular focus on condensed matter physics research. The applications to quantum manybody problems are presented in Sects. 5-7. Section 5 focuses on the "sparsity" of Matsubara Green's function and introduces a proper basis. This basis is utilized for a new method for analytical continuation in Sect. 6 and efficient calculations of many-body problems in Sect. 7. Finally, this review is closed with a summary in Sect. 8.

Inverse problem of underdetermined systems
In this section, we present the concept of sparse modeling, which opens a new paradigm for solving inverse problems. Starting from a simple problem, we try unveiling the essence of the sparse modeling. To this end, we consider a simple inverse problem, namely, a linear equation of the form where x and y are vectors of N-and M-dimensions, respectively, and A is an M Â N matrix. Let us suppose that we want to obtain x from known variables y and A.
Obviously, the equation can be solved immediately, if M ¼ N and the inverse of A exists: However, if M < N, the number of equations is insufficient for x to be determined uniquely. We consider such systems, called underdetermined systems, in the rest of this section. Even for underdetermined systems, there are cases where the equation can be solved exactly. Here, "sparsity" plays an essential role. The vector x is called sparse if most of its components are zero. Let n be the number of non-zero components in x. If we can find the positions of zeros in x and remove these zeros from the set of equations, then the number of unknown components will be reduced from N to n. Thus, equations that belong to underdetermined systems are solvable if M > n.

Methods for finding sparse solutions
The problem now is whether and how to find the positions of the non-zero components in x. One of the simplest methods is L 0 -norm minimization. The L 0 -norm, represented by kxk 0 , counts the number of non-zero components in x. By selecting the solution that minimizes kxk 0 from the set of solutions of the underdetermined system, the most sparse solution is obtained. This statement is formulated as min x kxk 0 subject to y ¼ Ax: However, L 0 -norm minimization is a combinatorial optimization problem, which requires exponential cost of computation. Therefore, an alternative formulation is required from a practical point of view. A feasible approach for simultaneously handling the sparsity requirement and computational cost is to relax the norm from L 0 to L 1 , which leads to min x kxk 1 subject to y ¼ Ax: Here, kxk 1 denotes the L 1 norm defined by This problem can be solved with moderate computational complexity using the interior point method which is an optimization technique. The solution of Eq. (4) is sparse, as shown below. Let us consider the simplest case, with N ¼ 2 and M ¼ 1: One linear equation is given between two unknown variables x 1 and x 2 . The set of solutions forms a straight line on the x 1 -x 2 plane as shown in Fig. 1. The L 1norm is represented by kxk 1 ¼ jx 1 j þ jx 2 j. Its contour is a rhombus. The the optimization problem in Eq. (4) is now interpreted as follows. The rhombus must intersect the straight line and its size should be as small as possible. The solution that satisfies this statement is, for the case in Fig. 1, located on the x 1 -axis because of the cuspidal nature of the L 1 -norm. Thus, Eq. (4) yields a unique solution in which some components tend to be zero, that is, a sparse solution. Sparsity and non-exponential computational cost are therefore compatible for L 1 -norm minimization.

Conventional method: L 2 -norm minimization
Traditionally, L 2 -norm has been used in determining a solution of underdetermined systems. L 2 -norm is also referred to as the Euclidean norm, that is, the "ordinary" norm defined by L 2 -norm minimization is thus written as min x kxk 2 subject to y ¼ Ax: The solution of Eq. (7) corresponds to the intersection between the straight line and the circle as shown in Fig. 2. We note that both x 1 and x 2 are finite unlike the case in L 1norm minimization. The advantage of L 2 -norm minimization is that the solution can be evaluated analytically. Using the Lagrange multiplier method, Eq. (7) is rewritten as where is the Lagrange multiplier. Taking the derivatives and assuming the underdetermined condition M < N, we obtain the solution x Ã of this optimization problem as where A þ ¼ A T ðAA T Þ À1 is called the Moore-Penrose pseudoinverse matrix. 12)

Finding the true solution
As mentioned, a unique solution can be determined for underdetermined systems, if an additional condition is provided. Then, it is important to judge whether the obtained solution is reasonable. L 1 -norm-minimized solution is sparse. Therefore, if the true solution is also sparse, there is a chance that L 1 -norm-minimized solution coincides with the true solution. It has been proven, for a specific model, that L 1norm minimization indeed yields the exact solution. [13][14][15][16] In contrast, an L 2 -norm-minimized solution does not have this feature. For underdetermined problems, the L 2 -normminimized solution exactly coincides with the true solution only when N ¼ M, and thus the additional condition does not make sense for finding the true solution. One can only avoid obviously unreasonable results that involve infinitely large components.
We illustrate the difference between L 1 -and L 2 -norm minimizations by considering a random matrix A rand as an example. 17) We suppose that the true solution x 0 has only n ¼ 20 non-zero components among N ¼ 1000 [ Fig. 3(a)]. This vector is converted to y ¼ A rand x 0 , which has M ¼ 100 components [ Fig. 3(b)]. This problem satisfies the condition n < M < N, meaning that the equations are underdetermined but solvable if the zero components are properly eliminated. Figures 4(a) and 4(b) show x reconstructed using L 1 -norm minimization [Eq. (4)] and L 2 -norm minimization [Eq. (7)], respectively. As shown, L 1 -norm minimization perfectly recovers the true solution x 0 , whereas L 2 -norm minimization fails to do so because the weights are distributed among all components.
The property demonstrated above has been proven analytically using integral geometry 13,14) and information statistics. 15,16) Let us define M=N (the ratio between the dimensions of y and x) and ¼ n=N (the ratio of non-zero components in x to all components), and consider the continuous limit N ! 1. Then, there exists a critical value c above which the exact solution is reconstructable. As shown in Fig. 5, L 1 -norm regularization yields a finite region where c 1 is satisfied, while with L 2 -norm regularization, c ¼ 1, meaning that the exact solution can be obtained only when the complete information of y is available.

Handling noise in inverse problem
We have so far assumed that x follows the equation exactly. In practical situations, however, the input to the equation, y, contains noise and the equation does not need to be satisfied rigorously. This situation is represented as follows: where x 0 is the true solution and is an M-dimensional noise vector. Assuming a Gaussian distribution with a zero mean, the maximum likelihood estimation for x corresponds to the minimization of the function  x 0 x L1 x (a) which is called the minimum mean square error estimation. We use the notation F analogous to the free energy in statistical physics. The subscript 0 indicates that no extra term is introduced besides the quadratic term. The minimum point of F 0 ðxÞ can be analytically expressed as However, for the underdetermined condition M < N, this solution suffers from numerical instability (division by zero) because the N Â N matrix A T A is rank deficient. A converged solution can be obtained by granting an additional term to F 0 ðxÞ in Eq. (11). This approach is called regularization and the additional term is called a regularizer. If we adopt the L 2 term as a regularizer, we obtain where λ is a small constant. This regularization is referred to as Ridge regression. The L 2 term replace A T A with (A T A þ I) in Eq. (12) to yield Here, I denotes the unit matrix. Because of λ, the inverse always exists and the solution is well-defined. However, the L 2 term tends to make the solution featureless as shown in Fig. 4, and thus coincidence with the true solution is not expected.
It is natural to replace L 2 with L 1 in Eq. (13) to select out a sparse solution. Then, the function FðxÞ to be minimized is Here, λ is a parameter that controls the sparsity. The minimization problem in the form of Eq. (15) is called the least absolute shrinkage and selection operator (LASSO). 18) The LASSO is a type of convex optimization problem, which guarantees convergence of the iterative update procedure to the unique solution.
The parameter λ is often called a hyperparameter to distinguish it from the ordinary parameters that specify a model. The value of λ should be determined so that the effect of regularization is moderate. The inverse problem is unstable if λ is too small, and the result becomes artificial if λ is too large. Techniques for automatically fixing the value of λ are presented in Sect. 3.4.

Maximum entropy method
As mentioned, an additional condition (regularization) helps to determining a unique solution of underdetermined equations. The MEM is also one of such methods. 19) As a regularizer, the MEM employs a distance with an "ideal" solution called the default model m. The default model is prepared so that it fulfills all prior knowledge for an expected solution. A solution obtained is thus not far from m as expected.
The function FðxÞ to be minimized in the MEM is given by 19) where Sðm; xÞ is defined as Here, Sðm; xÞ quantifies a "distance" between m and x, referred to as information entropy or Kullback-Leibler divergence. 20) Let us consider the effect of the regularizer Sðm; xÞ, as done for L 1 -and L 2 -norm regularization in Figs. 1 and 2. Figure 6 shows contour lines of Sðm; xÞ in the x 1 -x 2 plane. As noted earlier, the solution is given by the intersection between the straight line, which shows the set of solutions for y ¼ Ax, and one of the contour lines. In this view, the solution exhibits no particular characteristics except that it is close to m. There is no reason to expect coincidence between the MEM solution and the true solution because the former depends on the choice of default model m.
The MEM has been applied in various fields. Examples include the calculation of spectral functions, which we discuss in Sect. 6, and astronomical image analysis. 21,22) Practically, the ambiguity due to the default model may be less severe in the case of astronomy because a typical default model is available from the observed data. We have to keep in mind, however, that the default model should be simple enough to avoid biased analysis.  2.7 Is the true solution really sparse?
The success of L 1 -norm regularization strongly relies on the sparsity of the true solution. One may think that the true solution in realistic problems is not always sparse and that therefore L 1 -norm regularization would not work in these cases. We emphasize that sparsity is basis-dependent. In other words, we can manifest the potential sparsity of the data by transforming the basis.
Let us suppose that x represents images. Then, the spatial variation of x is expected to be smooth in an extensive region and abrupt at the edge of some object. This contrast indicates a possible sparsity of the data in the representation Dx, where D is a matrix obtained by taking the difference between adjacent elements of x, namely, ðDxÞ k ¼ x kþ1 À x k for one dimension.
We can generalize the LASSO defined in Eq. (15) so that the basis transformation of x is taken into account. The general form of the LASSO is thus where B is an M 0 Â N matrix, which transforms x to a sparse representation x 0 Bx with dimension M 0 . It is crucial to properly choose the basis x 0 in applying the LASSO to a problem. In Sect. 5, we present an example in which the use of LASSO leads to a new basis that compactly represents x.

Relation to machine learning
We close this section by discussing the relation between sparse modeling and machine learning. In applications of machine learning, we are interested in the result for prediction, classification, etc., but not in how machines predict. Optimized parameters inside machines are typically not analyzed or are difficult to analyze because of the huge number of parameters and complexity of nonlinear transformations. In sparse modeling, in contrast, we are interested in the processes used for prediction. For this aim, it is crucial to find relevant parameters (or descriptors) among all parameters. The sparsity criterion plays a central role in this task.
Although the main concepts of sparse modeling and machine learning are different, these methods share some technical details. L 1 -norm regularization is utilized in a learning process to remove redundant parameters, which cause overfitting and thus make prediction unreliable. As the number of retained parameters is reduced, it becomes possible to examine the optimized parameter set and determine which parameters mostly control the prediction.
Finally, we will mention a direct application of sparse modeling to machine learning. We have so far assumed that the matrix A is given in Eq. (1) and a sparse solution for x is pursued. There is another class of optimization problems that searches for A as well as x for a given dataset fy i g. These problems, called dictionary learning, are presented in more detail in Sect. 4.4.2.

Algorithms for Solving Inverse Problems
The previous section demonstrated that the sparsity constraint can be implemented with L 1 -norm regularization. In this section, we discuss how to solve a minimization problem that includes the L 1 -norm term. Two classes of method are introduced, namely the iterative shrinkage thresholding algorithm (ISTA) in Sect. 3.2 and the alternating direction method of multipliers (ADMM) in Sect. 3.3. Then, Sect. 3.4 describes methods for determining an optimal value of λ, which controls the sparsity of the solution.
In this section, we consider how to solve the optimization problem where arg min x is an operator that returns x that minimizes the operand. We suppose that the target function FðxÞ is represented by the following general form: where fðxÞ and gðxÞ are the differentiable and non-differentiable functions, respectively. For the LASSO in Eq. (15), fðxÞ and gðxÞ are given by Recall that the dimensions of x and y are N and M, respectively, and the underdetermined condition, N > M, is supposed.

Soft threshold function
It is instructive to begin with the one-dimensional case, N ¼ M ¼ 1, and see the effect of the L 1 -norm term. The LASSO in Eq. (15) is rewritten as Minimization of this function can be performed by considering x > 0 and x < 0 separately. The solution is called the soft threshold function S ðyÞ, and is given by Figure 7 shows the solution x ¼ S ðyÞ compared with x ¼ y, i.e., the solution in the case without the L 1 term ( ¼ 0). It turns out that for a given y, its absolute value is reduced by λ and approaches zero as a lower limit. A small input is thus discarded and a sparse solution is generated. 3.2 Method I: ISTA Gradient descent is a simple and fundamental method for finding the solution of inverse problems. However, because the derivative of gðxÞ ¼ kxk is discontinuous, it cannot be naively applied to the LASSO. In the following, starting from the gradient descent, we consider how to treat the nondifferentiable function gðxÞ and derive an alternative update formula that is applicable to the LASSO.

Majorization-minimization (MM)
In the gradient descent, the vector x is updated iteratively using the derivative of fðxÞ as where η is a small quantity. The exact solution is obtained as long as fðxÞ is a convex differentiable function in the region considered. This update formula can be expressed as which is called the majorizer of the function fðxÞ. The majorizer approximates fðxÞ around x ¼ x t with a quadratic function as a Taylor expansion, but the quadratic term is simplified with an isotropic form [no dependence on rfðx t Þ].
The most important feature of the majorizer is that, if fðxÞ is sufficiently smooth around x t and if η is sufficiently small, 23) f 1= ðx; x t Þ satisfies the following inequality against the original function fðxÞ: 24) fðxÞ f 1= ðx; x t Þ: ð27Þ The equality is satisfied at x ¼ x t if rfðx t Þ ¼ 0, namely, when the temporary solution x t reaches the exact solution.
This inequality indicates that one may use the majorizer f 1= ðx; x t Þ instead of fðxÞ to find the minimum of fðxÞ. The gradient descent can be regarded as a successive minimization off 1= ðx; x t Þ. Now, we exploit the inequality (27) for establishing an algorithm for solving minimization problems that include a non-differential function. Adding gðxÞ to both sides of Eq. (27), we obtain This majorizer defines an alternative minimization problem that does not include Ax and hence is solvable in contrast to the original FðxÞ. The solution attained through minimizing F 1= ðx; x t Þ is expressed as where The update formula, Eq. (30), is very generic and can be used even if gðxÞ is not a differentiable function. This method is called the majorization-minimization (MM) algorithm.

Application of MM to LASSO
Let us apply the MM algorithm to the LASSO. The update formula is obtained by substituting gðxÞ in Eq. (30) with Eq. (21). Then, the minimization is performed for each element separately because all elements are independent and have the form of Eq. (22). The solution is hence the softthreshold function defined in Eq. (23): where ½Á k denotes the k-th element of the vector. We represent the above equation simply by Here, S ðvÞ is regarded as an element-wise soft threshold function. An explicit expression for v is obtained from We set ¼ 1=kA T Ak 2 to satisfy the condition of the majorizer. 25) The update in Eq. (33) is repeated until x t converges. This iterative algorithm based on the MM method is called ISTA. Based on Nesterov's acceleration, 26) a faster version of ISTA (called FISTA) has been proposed. 27)

Method II: ADMM
We now describe the ADMM, a flexible method developed by Boyd et al. 28) As discussed in Sect. 2.7, the choice of basis is crucial in applications of the LASSO. Hence, we need to consider the L 1 term of the form gðxÞ ¼ kBxk 1 , where B is a transformation matrix to a sparse basis. For this case, ISTA does not work well without introducing complications 29) but the ADMM does. 30) The only possible difficulty of the ADMM is the required computation of an inverse of a N Â N matrix (shown later). If the inverse matrix can be obtained, the ADMM should be the first choice in practice.

Augmented Lagrange multiplier method
We first represent the minimization problem, min x FðxÞ, of the function in Eq. (19) as The single minimization problem is split into two problems with an additional constraint. A minimization is performed for x and z separately, and the constraint is then imposed gradually. This treatment leads to fast convergence and allows another constraints to be flexibly handled.
Normally, a constraint is treated using the Lagrange multiplier method. However, this method is not a good choice in the present situation, because the Lagrange multiplier term enforces the constraint rigorously in each iteration step, and the two minimization problems are strongly coupled. To relax the constraint, we apply the augmented Lagrange multiplier method, 31) and formulate Eq. (35) as where is the Lagrange multiplier and μ is the coefficient of the penalty term, khðx; zÞk 2 2 . The penalty term pushes the solution to follow the constraint, hðx; zÞ ¼ 0, rather gradually. Although the penalty term itself does not enforce the constraint rigorously, the Lagrange multiplier term instead enforces the constraint in a converged solution. The augmented Lagrange multiplier method thus takes advantage of the two methods; it achieves fast convergence and a rigorous implementation of the constraint. Minimization of Fðx; z; Þ with respect to x, z, and is performed iteratively for a fixed value of μ. The vectors x and z are updated based on the solution of the individual minimization problems at each iteration step. The solutions x new and z new are written symbolically as Explicit solutions depend on the form of fðzÞ and gðxÞ. The multiplier is updated using the rule 31) The update in Eqs. (37)-(39) are repeated until convergence is reached.

Application of ADMM to LASSO
Let us apply the discussion above to the LASSO. Replacing fðxÞ and gðxÞ with Eqs. (20) and (21), respectively, we rewrite Eqs. (37) and (38) as Here, we changed the variable into u = to simplify the notation. The minimization of the quadratic form in Eq. (40) can be done analytically to yield The minimization problem in Eq. (41) is independent among vector elements. Therefore, the solution in the one-dimensional case applies to each element, and z new is given by where S 1= ðBx þ uÞ is the element-wise soft threshold function defined in Eq. (23). The update of u is obtained from Eq. (39) as The computation procedure is summarized as follows. For given λ and μ, we begin with initial vectors x ¼ 0, z ¼ 0, and ¼ 0. The updates, Eqs. (42)- (44), are repeated until convergence is reached. We note that, in Eq. (42), the inverse of a matrix of size N Â N is performed before the iteration starts. Then, the updates include only matrix-vector products.

How to fix the hyperparameter λ
The optimization problem in the LASSO [Eq. (19)] includes the hyperparameter λ, and its solution x Ã ðÞ thus depends on λ. We then need to fix the value of λ to select the best solution among x Ã ðÞ. We introduce two methods that are often employed in the literature. For a better description, we again consider the explicit problem of the random matrix A rand introduced in Sect. 2.4. 17) The true solution x 0 has only n ¼ 20 non-zero components out of N ¼ 1000, as depicted in Fig. 3(a). The input y with dimension M ¼ 100 is now disturbed by a Gaussian noise according to Eq. (10). Figure 8 shows comparison between y and Ax 0 y 0 .

"Elbow" method
Let us begin by observing the influence of λ on the solution x Ã ðÞ. To this end, we introduce a score sðÞ that quantifies the extent to which the original equation, y ¼ Ax, is satisfied. The coefficient of determination, R 2 , is used as a function for the score, which in the present case, is defined as where hyi denotes the mean value of y. In this definition, the squared error ky À Ax Ã ðÞk 2 2 is normalized by the variance of y so that it can be compared between different dataset of y. The score sðÞ yields a maximum of 1 when the equation y ¼ Ax is exactly satisfied; it decreases (even becomes negative) as the deviation between y and Ax Ã ðÞ increases. Figure 9(a) shows the numerical result for sðÞ in the random matrix problem. This graph represents the typical behavior of sðÞ. When λ is sufficiently small, the score reaches a maximum value of sðÞ ≒ 1, because the fitting to the equation y ¼ Ax takes priority over a reduction of bases by L 1 -norm regularization. This high score, however, does not indicate agreement between the obtained solution x Ã ðÞ and the true solution x 0 , but instead implies that x Ã ðÞ is highly affected by the noise (overfitting). Figure 9(b) shows that the number of non-zero components, n Ã , in x Ã ðÞ is much higher than the actual number n ¼ 20. The value of sðÞ is insensitive to the variation of λ as long as n Ã is far above n ¼ 20, namely, in the region ≲ 10 À1 . Once λ exceeds this region and n Ã passes below n ¼ 20, sðÞ drops drastically and falls below 0.5, where the original equation y ¼ Ax is not respected anymore. For ≳ 2 Â 10 1 , no component is retained, i.e., n Ã ¼ 0 (underfitting). From this behavior, we expect a reasonable solution around the "elbow" (bending point) of sðÞ, where two effects, i.e., fitting to the original equation and L 1 -norm regularization, are competing. We choose Elbow 0:2 by rotating the graph by 45 degree and find the maximum. This simple strategy has been utilized in clustering analysis 32) and interaction network analysis. 33,34) The λ-dependent solution x Ã ðÞ is shown for three values of λ in Fig. 10. The optimal solution x Ã ð Elbow Þ in Fig. 10(b) turns out to hit the correct non-zero components, although these absolute values non-negligibly deviate from the exact values. Recall that an exact recovery of x 0 as discussed in Sect. 2.4, is possible only in the absence of noise. When λ is too small (large), the solution x Ã ðÞ contains too many (few) non-zero components and is clearly inconsistent with the exact solution x 0 .

Cross-validation method
An alternative, but more sophisticated estimation can be carried out using the cross-validation (CV) method in statistics. 35) We first describe the concept of the CV method and then discuss practical implementations.
The input dataset y is divided into two subsets, namely training dataset y T and validation dataset y V . An optimization problem is set up with y T , with the LASSO function in Eq. (15) rewritten as where P T is a projection operator into the subspace defined by y T . After the optimization problem is solved, the solution x Ã ðÞ is validated with y V by evaluating the score defined by where the operator P V ¼ 1 À P T projects Ax Ã ðÞ into the subspace with y V . The important point here is that the training (fitting) and the validation are preformed with different datasets. Because of this, s CV ðÞ does not approach 1 even in the limit ! 0 in contrast to sðÞ in Eq. (45). In the presence of noise, a solution that perfectly fits a specific dataset does not fit a different dataset, resulting in a reduction of s CV ðÞ for ! 0. The highest score is achieved when the input y T is moderately fitted to avoid the influence of noise. In practice, the division into y T and y V should be repeated to obtain a reliable estimation of s CV ðÞ. K-fold CV can be used to systematically generate multiple divisions of the dataset. We split the input vector y into K groups. One of them is assigned as y V and kept for validation, and the other K À 1 groups are assigned as y T . A set of training and validation is performed K times, i.e., for all possible configurations. The final score is obtained by averaging the K estimations of s CV ðÞ. Figure 11(a) shows s CV ðÞ computed by a 5-fold CV calculation. The maximum in s CV ðÞ yields the optimal value CV 0:03, which is smaller than the previous estimation Elbow ¼ 0:2 by order of 10. The solution x Ã ð CV Þ is plotted in Fig. 11 Fig. 10(b).   The non-zero components are better fitted compared with x Ã ð Elbow Þ in Fig. 10(b), while at the same time, more redundant components remain finite in x Ã ð CV Þ though their absolute values are small. Which method yields better estimation depends on the problem. Nevertheless, the CV method is useful for removing ambiguity due to hyperparameters without bias. The error bars in Fig. 11 can be reduced by increasing the division number K. In particular, CV with the limit K ¼ M (the dimension of y) is referred to as leave-one-out CV (LOO-CV). An LOO-CV calculation produces a rather smooth curve of s CV ðÞ, but has a large computational cost for solving the LASSO problem M times repeatedly. To mitigate this problem, Obuchi and Kabashima 36) derived approximate formulas for LOO-CV. The formulas can be evaluated easily using the solution x Ã ðÞ computed once for the full dataset of y, meaning that no repeated computation is necessary for different set of fy V ; y T g. For details, we refer readers to the original paper, Ref. 36.

Applications and Further Developments
In this section, we introduce applications of sparse modeling based on L 1 -norm regularization. The first two subsections describe applications to measurement and experimental data analysis. Another class of applications, namely the construction of an effective model from firstprinciples and experimental data, is presented in Sect. 4.3. Section 4.4 introduces further developments and extended theories related to sparse modeling. Applications to quantum many-body theories are presented in separate sections.

Compressed sensing
The ability of L 1 -norm regularization to find the true sparse solution has led to innovation in measurement methods. An early work in the 1970's pointed out the potential utility of L 1 regularization in the context of geophysics. 37) Theories by Candés et al. 1,2) and Donoho 3) and their application to MRI by Lustig et al. 4,5) have revolutionized measurement. This technology is called compressed sensing (also compressive sensing or compressive sampling). [38][39][40] Let us suppose that there are measurements in which the quantities of interest, x, are connected to measurable quantities, y, through an inverse problem, y ¼ Ax. A typical example is the Fourier transform, which is represented by In MRI measurement, for instance, ðrÞ denotes the density distribution of H 2 O molecules and its Fourier component fðk i Þ is measured. Discretizing V into N blocks with equivalent volumes and introducing the notation we obtain the equation y ¼ Ax. In practical situations, y may be incomplete because the measurement might be fundamentally restricted or the number of sampling points is intentionally reduced. In practice, k i points are randomly sampled to avoid an artificial structure in the reconstructed data. In all cases, solving the equation y ¼ Ax with respect to x is an underdetermined inverse problem. To express this situation explicitly, we represent measured data by y 0 ¼ Py, where P is a projection operator onto the subset of y with dimension M. Then, the LASSO function in Eq. (15 0 ) is expressed as The basis transformation B is, for example, the finite difference operator D, which transforms MRI images into a sparse representation. If n, the number of relevant components of Bx, is sufficiently small compared to M, L 1 regularization reconstructs a clear image from the incomplete data y 0 . One could further reduce M, and thus achieves reduction of measurement cost and time.
Compressed sensing can be applied to any experiment that uses the Fourier transform in the analysis procedure. X-ray (neutron) diffraction experiments measure the structure factor fðk i Þ as the Fourier transform of the electron (nuclear) density ðrÞ. The MEM has typically been used for the inversion, 41,42) but compressed sensing is now an alternative. 43) Compressed sensing has also been applied to NMR spectroscopy for studying molecular dynamics, 44,45) scanning tunneling spectroscopy (STS)=scanning tunneling microscopy (STM) for investigating the k-space electronic properties from real-space measurements, 46) inverse X-ray fluorescence holography (IXFH) for deriving a three-dimensional image of atomic positions, 47) and X-ray absorption fine structure (XAFS) for elucidating atomic properties in solids. 48) The recent observation of a black hole shadow utilized the sparsemodeling technique because signals are received simultaneously at multiple observatories distributed worldwide and hence the sampling data are inevitably incomplete. [6][7][8][9][10][11]49)  4.2 Advanced experimental data analysis 4.2.1 Phase retrieval X-ray and neutron diffraction experiments measure the intensity Iðk i Þ, which is related to the structure factor fðk i Þ as Iðk i Þ / j fðk i Þj 2 . Hence, experiments only provide the information of j fðk i Þj. That is, in the expression the information of the phase ðk i Þ is missing. Methods have been established to retrieve the phase factor, such as iterative Fourier transform methods based on the error reduction algorithm 50) and the hybrid input-output algorithm. 51) To achieve robust and efficient phase retrieval, compressed sensing has been extended to the situation where only ðjy 1 j; jy 2 j; . . . ; jy M jÞ are known in the equation y ¼ Ax. 52) A similar approach has been applied to terahertz imaging 53) and coherent X-ray diffraction imaging (CDI). 54,55)

Peak identification
There is another interesting application of L 1 regularization to data analysis for STM experiments. The intensity plot in Fig. 12 shows an STM topography image measured for SrVO 3 thin film. High intensities indicate the existence of atoms. One can see the periodic alignment of atoms, some lattice defects, and a lattice dislocation. The identification of such features partly relies on experience. Definite criteria are desired for an unbiased analysis of topography images.
Miyama and Hukushima proposed an algorithm to identify the peak position in the topography image. 56) In this problem, y is a one-dimensional representation of the image and x k represents the weight of a peak at position r k . Here, x consists of a large number of peaks (as many as the number of pixels). Using L 1 -norm regularization, they retained relevant peaks and succeeded in identifying peak positions (open black circles in Fig. 12) without any tuning parameters. Thus, their method offers unbiased peak identification for topography images. A related approach for one-dimensional peak deconvolution has been proposed based on Bayesian inference. [57][58][59]

Model selection
In many situations, it is useful to construct an effective model that describes the numerical data obtained from numerical simulations or experiments. This model is sometimes used to make further accurate simulations while avoiding expensive calculations. Using experiments, one may want to get insight into a microscopic mechanism from the observed experimental data. For both simulations and experiments, sparse modeling (compressed sensing) is useful, as we will show in this subsection. First-principles calculations based on density functional theory (DFT) are a powerful tool for investigating and predicting the structure of a solid. However, fully firstprinciples calculations may be too expensive because the electronic structure must be determined for each atomic configuration.
Nelson et al. applied compressed sensing to cluster expansion (CE). 60) They illustrated the use of their method on two CE models of configurational energetics for Ag-Pt alloys and protein folding in Ref. 60. The CE method constructs an energy model for describing some configurations of a crystal. The total energy can be expressed as where f represents symmetrically distinct clusters of lattice sites (points, pairs, triplets, etc.). The symbol σ denotes the atomic configuration, which may be a collection of pseudo spins specifying the type of atom at each site. The matrix elements in " Å f ðÞ are obtained as symmetrized averages of the products of these pseudo spins. What we want to determine is J f , the effective cluster interactions (ECIs). Once ECIs are determined from expensive first-principles data (for many atomic configurations), one can compute an approximate value for the energy of any atomic configuration.
The ECIs may be considered as the extension of exchange couplings for Ising spins but contain vastly more complicated clusters than nearest neighbor pairs. Using L 1 -norm regularization, they fitted first-principles data with a small number of J f in Eq. (52), and succeeded in constructing a model with high predictive power. They further extended their method by using a Bayesian implementation of compressed sensing, which removes an adjustable parameter in the original implementation. 61) The importance of the selection of descriptors has been discussed in the context of describing the ground-state structure of binary compounds. 62,63) The potential energy surface (PES) plays a key role in molecular dynamics calculations. A PES describes the relationship between the energy and atomic configurations, and can be constructed from training data obtained using DFT calculations. The key factor for success is the choice of descriptors, which are basis functions for representing atomic configurations. Seko   showed that the energy can be expressed by a linear model with simple basis functions with a small number of coefficients. Figure 13 shows the accuracy of their model constructed for Mg with 95 basis functions. The prediction errors are typically smaller than 1 meV=atom.
Compressed sensing has also been used to construct effective models of phonons for strongly anharmonic crystals. Examples include the prediction of lattice thermal conductivity for compounds such as Cu 12 Sb 4 S 13 67) and anharmonic phonon frequency and phonon lifetime for cubic SrTiO 3 (STO). 66) Figure 14 shows the anharmonic phonon dispersion of cubic STO computed using the effective model; good agreement with experimental data can be seen.

Construction of minimal model for describing
theoretical or experimental data The sparse modeling can be used to construct a minimal effective model for describing theoretical or experimental data. For instance, one can construct a spatially localized model for reproducing the eigenfunctions and eigenvalues of a given system, such as localized Wannier functions. 68) This approach was extended and used to search for the localized Wannier functions of topological band structures. 69) The sparse modeling can be used to bridge experiments and theories. Tamura and Hukushima proposed the use of the sparse-modeling technique for estimating relevant microscopic parameters in an effective spin model from magnetization curves which can be measured in experiments. 70) Similarly, Mototake et al. proposed a method for analyzing core-level X-ray photoelectron spectroscopy (XPS) spectra based on the Bayesian model selection framework, where relevant parameters are automatically selected. 71) Another interesting theoretical approach is the construction of effective models from numerical data for Hubbard-type models. 72) It was shown that the resultant effective models reproduce the results of conventional approaches, such as perturbation theories. This approach is more general and may be applicable to analysis of experimental data.

Further developments 4.4.1 Diagnosis of compressed sensing results
Compressed sensing yields a result even from an incomplete dataset. We should, however, keep in mind that the result may fail to capture the correct features in the exact solution if the dataset is excessively incomplete. The example of the random-matrix model in Sect. 2.4 revealed the existence of a boundary that separates success and failure. To obtain a successful result, the sampling number M (the dimension of y) should be sufficiently large to satisfy M=N > c (Fig. 5). An important and practical question is whether it is possible to know that the number of samples is sufficient for compressed sampling without knowing the correct solution in the limit of M ! N.
Nakanishi and Hukushima proposed a method for diagnosing the results of compressed sensing as success or failure. 73,74) They applied K-fold CV (Sect. 3.4.2) and computed the CV error (CVE) defined by where M V ¼ M=K is the size of y V , x T is the solution computed for a given training dataset y T , and y V and P V are explained in Eq. (47). Figure 15 shows a log-log plot of CVE versus K for several values of α. The CVE vanishes to zero for > c (success), converges to a finite value for < c (failure), and exhibits a power-law decay at ¼ c . This result indicates that it is possible to judge the success or failure of compressed sensing using only a given dataset. If this diagnosis concludes success, then one can safely cease further measurement to reduce the measurement time and cost.

Dictionary learning
The performance of compressed sensing strongly depends on the sparsity of the vector x to be reconstructed. Therefore, it is crucially important to choose a proper basis (representation) such that x is expected to be sparse. However, there are situations where sparse representations are not known in advance. This fact motivates us to consider another problem of finding a sparse representation as well as a sparse solution for a given representation. The above problem is formulated as follows. 75,76) Let us suppose that in the equation y ¼ Ax, only the vector y is known and that a plural number of datasets, fy i g, are given. We want to find a representation of A that makes x sparse for all fy i g. This statement is expressed as We stress that, unlike in the optimization problem considered so far [see, e.g., Eq. (18)], the matrix A is not given in this equation but is determined so that x i becomes as sparse as possible. The constraint is to remove arbitrariness from the solution for A. In this context, the matrix A is called a dictionary and the problem defined in Eq. (54) is called dictionary learning or sparse coding, which is a kind of machine learning. The standard method for approximately solving this optimization problem is K-SVD (SVD stands for singular value decomposition). [76][77][78] In Eq. (54), we may replace the L 0 norm with the L 1 norm to avoid computational difficulty, as in the derivation of the LASSO in Eq. (15). For this LASSO-type dictionary learning, Mairal et al. proposed an efficient algorithm for online learning that allows iterative updating of the dictionary as new data of fy i g become available incrementally. 79)

Low-rank matrix completion
So far, we have considered the target quantity x to be a vector and focused on its sparsity. There is a related problem that takes advantage of the sparsity of a matrix instead of a vector. Let i and j represent indices of customers and products, respectively, and X ij be the correlation between i and j (X ij is large if customer i has checked or bought product j). Given partial data of the matrix X, we want to complete X to predict the extent of interest that customer i 0 has in product j 0 . This problem is called matrix completion.
For this problem, the sparsity of X has been shown to play a relevant role. Candés and Recht demonstrated that, if X is low-rank, then X can be reconstructed perfectly from incomplete data of X (denoted by A) by solving the optimization problem 80) min X rankðXÞ subject to X ij ¼ A ij for 8i; j 2 E; ð55Þ where E is the set of sampling components in A and rankðXÞ is defined as the number of non-zero singular values s l of the matrix X. The rank of a matrix corresponds to the L 0 norm of a vector. Therefore, the computational complexity of solving the optimization problem (55) is NP-hard. In analogy with the LASSO, it is natural to replace rankðXÞ with the matrix counterpart of the L 1 -norm, that is, the nuclear norm kXk Ã defined by where s l is the singular value of X. An alternative optimization problem can thus be written as which is in the class of convex relaxation. A practical method for solving the convex relaxation problem is singular value thresholding algorithm. 81) The influence of noise on matrix completion was discussed in Ref. 82.

Sparsity of Many-Body Green's Functions
This and the following sections discuss quantum manybody physics. In connection with the previous sections, we begin with the equation y ¼ Ax and see where it appears in quantum many-body problems. From this consideration, we will find sparsity hidden in many-body Green's functions.

y ¼ Ax in quantum many-body problems
The quantity we want to know, x, is the spectral function ð!Þ, such as the single-particle excitation spectrum measured in angle-resolved photoemission spectroscopy experiments or the magnetic excitation spectrum measured in inelastic neutron scattering experiments. One of the main tasks of theoretical investigations is to compute ð!Þ starting from a microscopic Hamiltonian. However, because it is difficult to treat ð!Þ directly, we introduce imaginary time it and consider correlation functions GðÞ in the imaginary-time domain. 83) This is called an imaginary-time or Matsubara Green's function. We assign GðÞ to y. Imaginary-time descriptions enable sophisticated treatments of interactions, such as perturbative expansions using the Feynman diagram and quantum Monte Carlo (QMC) simulations.
After GðÞ is evaluated with the aid of some analytical or numerical methods, the imaginary time τ should be transformed back to real time to derive ð!Þ. This procedure is the analytical continuation. In practical calculations, one may use the exact relation between GðÞ and ð!Þ, expressed as where G and denote vector representations of GðÞ and ð!Þ, respectively. An equation of the form y ¼ Ax thus appears. The problem of analytical continuation can be interpreted as the inverse problem of evaluating for a given G.

Exact relations
In this subsection, we complement Eq. (58) with a rigorous derivation and some remarks. GðÞ and ð!Þ are related to each other through the Lehmann representation, or spectral representation, of the form 83) G ðÞ ¼ where α specifies either the fermionic statistics ( ¼ F) or the bosonic statistics ( ¼ B). The variable τ ranges from 0 to β. The kernel function K ð; !Þ is given by The spectral function ð!Þ is related to the retarded Green's function G R ð!Þ by In the bosonic case, the kernel is defined with an extra ω to cancel the divergence of the Bose distribution function, 1=ð1 À e À! Þ $ 1=!, around ! ¼ 0. 19) Hereafter, we will omit the index α for simplicity when a particular distinction is unnecessary. Now, we transform the integral equation (59) into a matrix-vector representation. To this end, we first introduce a cutoff ! max for the infinite integral over ω. The variables ω and τ are then discretized into N and M slices, respectively, with equal intervals. Defining G i Gð i Þ and j ð! j ÞÁ!, we obtain Eq. (58). Here, K is an (M Â N) matrix defined by K ij ¼ Kð i ; ! j Þ. We note that the discretization error can be reduced by increasing N and M toward the continuous limit, N; M ! 1. In contrast, we cannot take the limit ! max ! 1, meaning that the cutoff ! max is essential. We will see that ! max Ã is the relevant parameter that controls the bases presented below. A detailed description is given in Sect. 7.

"Intermediate representation" (IR) of Green's functions
Let us consider the inverse problem in Eq. (58) from the sparse-modeling point of view. The central question in this respect is which representation (basis set) makes sparse because sparsity relies on representation, as emphasized in Sect. 2.7. We address this question by looking into the nature of the kernel K.
Using singular value decomposition (SVD), the real nonsquare matrix K can be decomposed as where the superscript T indicates the transpose of a matrix. order. An important observation is that s l decreases exponentially, or ever faster, as shown in Fig. 16. The meaning of the singular values can be explained as follows. We first transform the vectors G and using the orthogonal matrices U and V as Substituting Eq. (62) into Eq. (58), we obtain Because S is diagonal, this equation is reduced to an elementwise expression as This equation explains the role of singular values. The transformation from to G consists of three procedures: basis transformation using matrix V, weighting by s l , and transformation to the original basis using matrix U. The exponential decay of s l indicates that G contains all pieces of information of in extremely different weights. This situation is schematically shown in Fig. 17(a). Mathematically, the difficulty of treating a matrix on computers is quantified by the condition number C defined by C s max =s min , where s max and s min denote the maximum and minimum singular values, respectively. Unitary matrices yield the smallest value, C ¼ 1, which corresponds to the most well-conditioned case. On the other hand, when C ) 1, the matrix is said to be ill-conditioned. As shown in Fig. 16, s l of the kernel Kð; !Þ exhibits an exponential dependence, and hence the condition number C exceeds the machine precision, namely C ¼ 1 in practice. Thus, K is in the class of extremely ill-conditioned matrices and Eq. (58) is an illconditioned inverse problem.
Recall that the kernel K depends only on β. This means that the two unitary matrices, U and V, constitutes modelindependent basis sets. Because these bases connect real-and imaginary-frequency representations [Eq. (65)], they are called intermediate representations (IRs). 85) Figure 18 shows  a schematic diagram of an IR. We will show below a striking conclusion that results from the ill-conditioned nature of K.

Example
We consider the model spectrum ð!Þ shown in Fig. 19(a). It is transformed into GðÞ, as plotted in Fig. 19(b), by performing numerical integration in Eq. (59). The inverse transformation from GðÞ to ð!Þ is the main subject (analytical continuation) in the next section. Here, we focus on the basis transformation of each quantity.

Figures 19(c) and 19(d) [blue crosses]
show the expansion coefficients l and G l , respectively. Here, only the evennumber components of l and G l are shown because the oddnumber components vanish due to the symmetric condition, ð!Þ ¼ ðÀ!Þ. Both quantities exhibit exponential decay as l increases. In particular, we stress that G l decays much faster than l as a result of the exponential behavior of s l [see Eq. (65) and Fig. 16]. It should be emphasized that the fast decay of G l does not depend on a particular model but is an intrinsic feature of G l because it purely relies on the nature of the kernel Kð; !Þ.

Consequences of the exponential decay of G 0
l What consequences are implied by the exponential decay of G l ? To this end, we consider a situation where the input GðÞ has errors, such as the statistical errors in QMC simulations. We simulate this situation by adding Gauss noise with width ¼ 10 À3 onto the exact data in Fig. 19(b). The data with noise and the exact data are transformed into G 0 l in Fig. 19(d). The influence of errors is apparent in this expression. Because of the exponential decay of the exact data, the impact of noise increases as l increases, and finally dominates the exact value for l ≳ 14 l 0 . The high-order components contain only noise.
This implies two contrasting consequences depending on what GðÞ is used for. Let us first suppose that one wants to know ð!Þ. Then, the exponential decay of G 0 l indicates that the numerically computed GðÞ does not contain sufficient information of ð!Þ. In other words, the problem of analytical continuation is essentially an underdetermined problem, where the information required for solving the equation is lacking. Furthermore, in the next section we will show that the noise in the high-order components makes the inverse transformation unstable. Sparsity is thus a good precondition in the problem of analytical continuation.
Next, we suppose that one is interested in GðÞ itself rather than ð!Þ. In this case, the exponential decay of G 0 l leads to a positive consequence, allowing one to represent GðÞ using only a few bases without loss of meaningful information. For the data in Fig. 19, the 4000 points of GðÞ can be represented by only 7-8 components exactly within numerical accuracy. Using this representation, we can perform efficient computations of many-body theories such as QMC simulations. The details are discussed in Sect. 7.

SpM Analytical Continuation
In this section, we discuss how analytical continuation is performed using sparse modeling. The purpose here is to compute ð!Þ for a give GðÞ, which in general contains noise. This procedure is formulated as an inversion problem of the linear equation in Eq. (58). It involves considerable difficulty as discussed in Sect. 5. The matrix K is illconditioned and therefore the equation is an inevitably underdetermined system, in which the input G has a little meaningful information. Furthermore, the inverse of the illconditioned matrix amplifies noise exponentially as shown below.

Historical review
Let us first review the problem of analytical continuation with a brief overview of related approaches. Matsubara Green's function Gði! n Þ is defined as the Fourier transform of the imaginary-time Green's function GðÞ introduced in Sect. 5, Gði! n Þ ¼ R 0 d GðÞe i! n , where ! n is the Matsubara frequency defined by ! n ¼ ð2n þ 1Þ for fermionic systems and ! n ¼ 2n for bosonic systems, where n is an integer. When an analytical expression for Gði! n Þ is known, the spectrum ð!Þ is readily obtained by replacing i! n with ! þ i0, namely, ð!Þ ¼ Im Gð! þ i0Þ=. 83) Re Im K Intermediate representation (IR)  If only numerical values of Gði! n Þ are given, then the analytical continuation, i! n ! z, needs to be performed numerically. In other words, we infer an analytical expressioñ GðzÞ in the whole complex frequency plane from the values Gði! n Þ given at the discrete imaginary frequency points. Padé approximation 87) uses the rational functionGðzÞ ¼ ða 0 þ a 1 z þ Á Á ÁÞ=ðb 0 þ b 1 z þ Á Á ÁÞ to fit Gði! n Þ, and then extrapolates it into arbitrary complex frequency z. This method has been widely applied in combinations with, for example, diagrammatic perturbation theories. However, it is known that Padé approximation is quite sensitive to nonsystematic errors in Gði! n Þ as shown in Fig. 20. In particular, there is no guarantee that ð!Þ satisfies the fundamental properties such as nonnegativity and the sum rule. These disadvantages exclude the use of Padé approximation for the analytical continuation of QMC data.

Explosion of errors
Equations (64) and (65) show that ð!Þ and GðÞ are directly connected to each other in the IR bases. Therefore, one could naively evaluate ð!Þ by converting the input GðÞ into G 0 l and using the relation This, however, does not work. Even tiny noise in the input GðÞ will cause serious influence on the results for ð!Þ.
To understand the influence of noise on analytical continuation, let us consider the data in Fig. 19(d) explicitly. Because the exact value of G 0 l decays exponentially (Â symbols), large-l components basically contain only noise, namely jG 0 l j $ ¼ 10 À3 (+ symbols) for l ' l 0 ¼ 14.  Equation (66) then yields j 0 l j $ =s l . This result shows that noise is amplified exponentially at large l. It also indicates that the influence of noise can be removed by truncating the high-order components in the IR basis.
In the treatment of large-scale data, it is common to truncate bases according to the singular values, a process called dimensionality reduction. Here, it is important to emphasize the difference between ordinary dimensionality reduction and the problem of analytical continuation. Normally, as the number of retained bases increases, the accuracy of the reduced representation improves. A typical example in quantum many-body physics is the density matrix renormalization group method, in which the density matrix is approximated with this technique. In the present case of analytical continuation, using more bases results in a worse spectrum because the noise is enlarged by small s l . Therefore, we need to select appropriate bases rather than perform truncation.

Selection of bases
We now select bases in the IR representation to remove the influence of noise. In other words, we need to find a sparse solution in this representation and thus, sparse modeling in Sect. 2 applies to the present problem.
We first define the squared error 2 of the target equation [Eq. (58)] as Here, the argument on the left-hand side ( ) indicates a quantity to be varied, and that on the right-hand side of the bar (G) indicates a fixed quantity. We can represent 2 using the IR as The expressions in Eqs. (67) and (68) are mathematically equivalent, but may give different values in numerical calculations because of the ill-conditioned nature of K.
To enforce sparseness on a solution in the IR domain, we introduce the L 1 regularization term and consider a LASSOtype minimization problem of the form 86,88) The second term is the L 1 regularization, which makes the solution sparse to remove irrelevant bases. The parameter λ controls the extent to which sparseness is enforced. We can determine an optimal value automatically, as discussed later. In practical situations, the input Gð i Þ computed in QMC calculations may be associated with statistical errors (error bars) ÁG i . More generally, the statistical errors are expressed by the covariance matrix C, whose diagonal part corresponds to C ii ¼ ðÁG i Þ 2 . In this case, we could extend the squared error in Eq. (67) in a form that takes C into account: 19) 2 ð jG; CÞ ¼ 1 2 The role of C can be understood by considering that the diagonal components, ðG À KÞ 2 i , are weighted by 1=ðÁG i Þ 2 in the summation and thus more accurate components have stronger influence on 2 . It has been shown that for the MEM and the stochastic method, the inclusion of ÁG i or C improves the results of analytical continuation. 19,93,109) Using the IR basis, Eq. (70) is rewritten as where the matrix W is defined as W ¼ U T C À1 U. The function to be minimized is This expression is reduced to Eq. (69) by replacing C with a unit matrix, i.e., Fð 0 jG 0 ; 1; Þ ¼ Fð 0 jG 0 ; Þ.
Our task now is to find 0 that minimizes Fð 0 Þ in Eq. (69) or Eq. (72) subject to two constraints, namely non-negativity and the sum rule: where c ¼ 1 for ordinary (spin-and orbital-) diagonal fermionic Green's function, and c ¼ 0 for off-diagonal components. In general, one can determine the value of c as The fermionic expression corresponds to the coefficient of the high-frequency asymptotics, G F ði! n Þ $ Àc=i! n , and the bosonic one corresponds to the static susceptibility, G B ði! n ¼ 0Þ. Even with these additional constraints, the minimization problem in Eq. (72) can be solved using the ADMM algorithm presented in Sect. 3.3. For details, see [Fig. 22(c)], the spectrum is constructed using many bases, including those that contain no relevant information. Such redundant bases result in spiky and oscillatory spectra, which lack reproducibility. When λ is properly optimized, SpM selects relevant components that are not influenced by noise. The optimal value of λ was determined using the elbow method (Sect. 3.4.1). Figure 23(a) shows the λ-dependence of the squared error 2 ð 0 ðÞjG 0 Þ. The behavior is similar to that for the random matrix model in Fig. 9(a), where 2 is insensitive to the variation of λ in the small-λ region (overfitting), and increases steeply in the large-λ region (underfitting). The reasonable spectrum in Fig. 21(b) was obtained around the elbow at ¼ 10 À1:8 opt . Figure 23(b) shows the number of bases that are retained in the converged solution 0 (number of red circles in Fig. 22) exhibits a plateau around ¼ opt , indicating the stability of the solution against an order of magnitude change in λ.

Robustness against noise
The results in Sect. 6.4 demonstrate that the SpM method automatically removes irrelevant components that are strongly affected by noise [ Fig. 22(b)]. Therefore, ð!Þ evaluated using SpM is expected to be robust against noise. This property was verified in Ref. 88, where SpM analytical continuation was performed using 30 datasets of GðÞ (different noise configurations) and the mean value and the standard deviation were estimated at each ω. Figure 24 shows the results for the SpM method. It is clear that SpM yields quite robust spectra even for ¼ 10 À3 , whereas the Padé results (Fig. 20) show severe noise dependence.

Extent to which a spectrum is reconstructable
We have shown that the sparse modeling technique allows stable analytical continuation even in the presence of noise. We can obtain almost the same results if the noise level, or statistical errors in QMC calculations, is of a given order. Note, however, that this does not mean that the obtained spectrum is correct. SpM fully uses relevant information   J. Phys. Soc. Jpn. 89, 012001 (2020) contained in the input, but the information lost due to noise is not regenerated. Let us discuss in more detail the extent to which the true spectrum is reconstructable. Figure 25 shows spectra obtained using SpM analytical continuation from several input datasets with different noise levels σ. The variation of the reconstructed ð!Þ (red curves) signifies how much relevant information remains against noise. The result for ¼ 10 À2 shows a large deviation because only five data points of G 0 l are retained above the noise level. For smaller noise levels, namely ¼ 10 À3 , ¼ 10 À4 , and ¼ 10 À6 , 6, 8, and 11 relevant points of G l are retained, respectively, and thus a better ð!Þ can be reconstructed. The result for ¼ 10 À6 shows perfect agreement with the exact ð!Þ. These results demonstrate the limitation of analytical continuation in the presence of noise.

Application of Intermediate Representation of Green's Functions to Many-Body Calculations
The IR is a compact representation of the imaginary-time dependence of Green's functions. 85) As shown in previous sections, the IR plays a substantial role in SpM analytical continuation. The compactness of the IR may make it useful for reducing the computation time and memory consumption of quantum many-body simulations. In Ref. 85, the present authors proposed the use of the IR basis for efficient and compact many-body calculations, such as the measurement of Green's function in QMC calculations. In Ref. 84, some of the present authors proposed a numerical algorithm for computing the IR basis functions precisely, and investigated the properties of the IR in greater depth. Numerical data of the IR basis functions are available online. 110) In this section, we describe the properties of the IR basis functions in more detail and show some applications to many-body calculations.

Mathematical properties of IR basis functions
We now formulate the IR basis functions in the continuous limit. 84,85) The spectral (Lehmann) representation of the single-particle Green's function is where we introduce the cutoff frequency ! max . Here, we assume that the spectrum is bounded in the interval ½À! max ; ! max . The superscript α specifies statistics: ¼ F for fermions and ¼ B for bosons. The two complete orthonormal basis sets for IR, fU l ðÞg and fV l ð!Þg, are defined through the decomposition for 2 ½0; and ! 2 ½À! max ; ! max . These basis sets are orthogonalized as This decomposition corresponds to the continuous limit of the SVD of the kernel discretized on a discrete and uniform mesh of τ and ω (refer to Sect. 5.2). In practice, the basis functions can be computed by solving the integral equation under the orthonormal conditions. The basis fU l ðÞg also satisfies the inhomogeneous Fredholm equation of the second kind Z The integral equation (78) can be recast into a dimensionless form by using the change of variables (x 2= À 1 and y !=! max ). 85) This explicitly shows that the singular values depend on only the statistics and a dimensionless parameter Ã ! max up to a constant.
Recently, some of the present authors and co-workers developed an efficient numerical algorithm for solving the integral equation [Eq. (78)]. 84) Although the analytic form of the solution is unknown, numerical studies have revealed some interesting properties. When the singular values are ordered in descending order, for even (odd) values of l, U l ðÞ and V l ð!Þ are even (odd) functions with respect to the center of the domain, i.e., ¼ =2 or ! ¼ 0. More interestingly, U l ðÞ and V l ð!Þ are reduced to the Legendre polynomials in the limit Ã ! 0 (if the ranges of τ and ω are scaled properly). 85) Similarly to classical orthogonal polynomials such as Legendre and Chebyshev polynomials, all the available numerical data indicate that U l ðÞ and V l ð!Þ have l zeros in their domains. Figure 26 shows the IR basis functions computed for Ã ¼ 100 ( ¼ 10 and ! max ¼ 10). One can clearly see the interesting properties discussed above. We refer readers to Ref. 84 for more details on the properties of the IR basis.
The integral equation [Eq. (78)] is ill-conditioned as the singular values decay exponentially. Thus, solving it numerically requires arbitrary-precision arithmetic, which is computationally expensive. A library is provided with precomputed numerical data of the basis functions. 110) It allows us to use the IR basis as easily as classical orthogonal polynomials.

Convergence properties of IR
We expand G ðÞ using the complete basis fU ðÞg as follows: If the spectrum of G ðÞ is bounded in ½À! max ; ! max (see Fig. 27), substituting Eq. (77) into Eq. (76) and comparing with the above equation, we obtain where l is given by l ¼ Equation (82) shows that the expansion coefficients G l decay at least as fast as S l . Because the singular values decay exponentially with l, one may not need basis functions that correspond to small singular values below S l =S 0 < to express Green's function in practical calculations (i.e., ¼ 10 À5 may be sufficient for typical noisy Monte Carlo data).
The upper panel of Fig. 28 shows the number of basis functions required for representing a typical model of fermionic GðÞ with a certain precision. The data obtained for two choices of ! max are plotted against β. The dimension of the basis increases logarithmically with Λ. Surprisingly, for bosons, it becomes saturated (see Fig. 4 in Ref. 84). These behaviors are in contrast to the power-law increase / 1=2 observed for the Legendre basis 84,111) and the Chebyshev polynomial basis. 112) These results indicate that only a constant number of IR basis functions will suffice in many-body calculations based on the imaginary-time Green's function at low T.
In practical calculations, we set ! max to a value much larger than the spectral width of the system to ensure that the spectrum is bounded in ½À! max ; ! max . The speed of convergence of G l depends on the choice of ! max . The choice of cutoff value only slightly influences convergence (only logarithmically). 84) This is demonstrated in the lower panel of Fig. 28. The compactness of the data is not affected by the choice of Λ (! max ) as long as the spectrum is bounded in ½À! max ; ! max .

Application of IR basis functions 7.3.1 Efficient quantum Monte Carlo sampling
Continuous-time QMC methods are widely used in the field of condensed matter physics. As demonstrated by the present authors in Ref. 113, the single-particle Green's function can be accumulated directly in terms of the IR basis. The authors considered the particle-hole symmetric single-site Anderson impurity model defined by the Hamiltonian where ¼ U=2 and σ is the spin index. c and c y are annihilation and creation operators at the impurity site, respectively, and a k and a y k are those at the bath sites (k is the internal degree of freedom of the bath), respectively. The distribution of k is a semi-circular density of states of width 4. Figure 29 shows the coefficients of the single-particle Green's function computed for U ¼ 4 and ¼ 100. Note that the data were accumulated directly in the IR basis (and the Legendre basis as a reference). The model was solved using the hybridization expansion continuous-time Monte Carlo technique. 114) In the upper panel of Fig. 29, one can clearly see that the IR yields coefficients that decay even faster than those for the Legendre basis. One can also see that the most compact representation is obtained when ! max ¼ Ã= matches the actual width of the spectrum. The optimal value obtained is Ã ' 1000 for ¼ 100, which is consistent with the largest dimensionless energy scale of the system, i.e., U, W ¼ 400. As Λ exceeds the optimal value, the efficiency only slowly decreases. The direct measurement of Green's function will reduce memory consumption and computational time.
The lower panel of Fig. 29 shows GðÞ reconstructed from the coefficients for l 6. The data obtained for the IR (Ã ¼ 500) shows perfect agreement with the numerically exact data. The truncation in the Legendre representation results in large Gibbs oscillations.

Noise filtering to finite-size effects
The projection of a single-particle object is useful not only for QMC data but also for those without statistical errors. This was demonstrated by Nagai and one of the present authors in Ref. 115 in the context of dynamical mean-field calculations using an exact-diagonalization impurity solver (DMFT+ED) at finite temperature.
In the finite-T DMFT+ED method, the self-energy AEð!Þ, which is assumed to be local in space, is determined selfconsistently in the procedure illustrated in Fig. 30. The effective impurity model is solved using the exact diagonalization method after the bath has been discretized. Although one can compute the real-frequency self-energy in the finite-T DMFT+ED method, its imaginary part, Im AEð! þ iÞ, is usually spiky because the bath is approximated by a finite number of bath sites.
A fundamental question is how much information is included in the self-energy. The imaginary part of the realfrequency self-energy can be regarded as the spectral function of the self-energy. Thus, the IR of the self-energy is where AE const is a frequency-independent term. Nagai and Shinaoka investigated how these expansion coefficients depend on the number of bath sites for the singleorbital Hubbard model with a semi-circular non-interacting density of states of width D. Figure 30   computed for a paramagnetic metallic solution at U ¼ 2D and ¼ 20. As can be clearly seen, only the first few coefficients AE l converge with respect to the number of bath sites. Larger l components are affected by finite-bath-size effects. This clearly indicates that only the first few components carry relevant information that converges with the number of bath sites. Correspondingly, as shown in Fig. 30(b), AE l for l ! 10 depends on the number of bath sites.
Nagai and Shinaoka also computed the physically relevant smooth spectral function, i.e., AE ð!Þ, by truncating the expansion of the self-energy at l ¼ 8 up to which the coefficients converge. 115) As clearly demonstrated in Fig. 31, the spiky components, arising from the finite-size effects, are removed. However, this native truncation breaks the causality of the self-energy. As shown in Fig. 31, this can be remedied by using SpM analytic continuation techniques.

IR approach for two-particle Green's functions
The concept of the IR was first proposed in the context of the single-particle Green's function. The present authors and co-workers extended the IR to two-particle Green's functions. 113) Two-particle Green's functions and vertex functions play a critical role in theoretical frameworks for describing strongly correlated electron systems. However, numerical calculations at the two-particle level often suffer from large computation time and massive memory consumption because these objects depend on multiple Matsubara frequencies and have a high-frequency and long-tail structure in the Matsubara frequency domain, which requires an elaborate treatment in practical applications. [116][117][118][119][120] In Ref. 113, the present authors and co-workers derived a general expansion formula for two-particle Green's functions in terms of an overcomplete representation based on the IR basis functions. The expansion formula was obtained by decomposing the spectral representation of the two-particle Green's function. It was rigorously shown that the expansion coefficients decay exponentially (the upper bound is also given by singular values S l ) while all high-frequency and long-tail structures in the Matsubara frequency domain are retained.
Because the expansion formula is rather complicated, we only present some numerical results here. The present authors and co-workers solved the Hubbard model using the dynamical mean-field theory combined with the continuous-time hybridization expansion QMC method. 114) We measured the three-point Green's function in the particlehole channel: G ph "" ði n ; i! n 0 Þ ¼ Z 0 d 12 d 23 e i n 12 þi! n 0 23 G ph "" ð 1 ; 2 ; 3 Þ; where G ph "" ð 1 ; 2 ; 3 Þ ¼ hT c " ð 1 Þc y " ð 2 Þc " ð 3 Þc y " ð 3 Þi: ð88Þ Here, n and ! n 0 are fermionic and bosonic Matsubara frequencies, respectively. We measured G ph "" ði n ; i! n 0 Þ in the rectangular Matsubara frequency domain of À100 n 99 and À100 n 0 100. Then, the data were transformed into the IR basis. Because the IR for the two-particle Green's function is overcomplete, G ph "" ði n ; i! n 0 Þ is expanded as a combination of three sets of expansion coefficients, g ðiÞ l 1 l 2 (i ¼ 1; 2; 3). Figure 32 shows one of the coefficients, g ð1Þ l 1 l 2 (other coefficients show similar behavior). As can be clearly seen, g ð1Þ l 1 l 2 decays (super-) exponentially in both directions. This representation thus enables an efficient treatment of two-particle quantities and opens a route for the application of modern many-body theories to realistic strongly correlated electron systems.

Concluding Remarks
We reviewed the theoretical background and applications of sparse modeling. Sparsity has been recognized as a useful criterion that selects a reasonable solution of an illconditioned inverse problem. Compressed sensing fully utilizes the sparsity of the expected solution, reducing measurement time. We emphasize that the choice of the basis (representation) is crucial for successful application.  One needs to design a sparse representation based on experience. If a sparse representation is available for a problem, it is worth considering to apply sparse modeling. Sparse modeling will reveal information that is covered by noise in ordinary data analysis.
It has been demonstrated that sparsity is useful also in quantum many-body theory. The kernel K in the relation G ¼ K is ill-conditioned, meaning that the Matsubara Green's function G has little information on the physical spectrum ρ in the presence of noise. The sparse-modeling technique offers a way to single out relevant information, allowing G to be compressed essentially without loss of information. The resultant compact representation is called the IR. Once G is represented in the IR basis, one can perform analytical continuation to obtain ρ or carry out many-body calculations within the IR basis. Methods based on Matsubara Green's functions can be reformulated using the IR basis, allowing efficient calculations of high-dimensional quantities in systems with multiple degrees of freedom. where e i ¼ 1 and ¼ c À hV 1 i hV 2 i : ðB : 6Þ P þ denotes a projection onto the non-negative quadrant, i.e., P þ z j ¼ maxðz j ; 0Þ for each element. S ðxÞ is the elementwise soft threshold function defined in Eq. (23). Starting with the initial condition, x 0 ¼ z 0 ¼ u 0 ¼ 0 and z ¼ u ¼ 0, the updates in Eqs. (B·5) are repeated until convergence is reached. The inverse of the matrix in Eq. (B·5a) is performed only once before the iteration. Then, the updates include only matrix-vector products, and thus the computational cost is quite low.