Learning performance in inverse Ising problems with sparse teacher couplings

We investigate the learning performance of the pseudolikelihood maximization method for inverse Ising problems. In the teacher-student scenario under the assumption that the teacher's couplings are sparse and the student does not know the graphical structure, the learning curve and order parameters are assessed in the typical case using the replica and cavity methods from statistical mechanics. Our formulation is also applicable to a certain class of cost functions having locality; the standard likelihood does not belong to that class. The derived analytical formulas indicate that the perfect inference of the presence/absence of the teacher's couplings is possible in the thermodynamic limit taking the number of spins $N$ as infinity while keeping the dataset size $M$ proportional to $N$, as long as $\alpha=M/N>2$. Meanwhile, the formulas also show that the estimated coupling values corresponding to the truly existing ones in the teacher tend to be overestimated in the absolute value, manifesting the presence of estimation bias. These results are considered to be exact in the thermodynamic limit on locally tree-like networks, such as the regular random or Erd\H{o}s--R\'enyi graphs. Numerical simulation results fully support the theoretical predictions. Additional biases in the estimators on loopy graphs are also discussed.


Introduction
Inference based on the classical Ising model is called the inverse Ising problem or Boltzmann machine learning, which is attracting more and more attention with the increasing interest in machine learning technologies. One recent application spurring this trend is for retinal neurons [1,2], and subsequent applications to a wide range of systems have been conducted [3,4,5,6,7,8,9,10], showing the potential usefulness of the inverse Ising framework.
However, it is difficult to compute standard estimators such as the maximum likelihood estimator in this framework when the system size is large. Thus, certain approximations and/or algorithms must be tailored to ease this difficulty and meet the demands of advanced applications, which have been attempted in previous studies [11,12,13,14,15,16,17,18,19,20,21,22,23]. One of the most effective examples is the pseudolikelihood method [11,20]. This method approximates the likelihood function from the product of conditional likelihood functions, each of which is for a single random variable conditioned by its neighboring variables. This is useful for a wide range of problems defined on graphical models, and enables us to treat large systems

Formulation
In this section, we briefly review the inverse Ising framework and two associated inference methods: the maximum likelihood (ML) and pseudolikelihood (PL) methods. The class of local learning is also introduced and explained. In addition, the general framework of the statisticalmechanical analysis proposed in [24] is presented. We emphasize its technically important points and review the case of the fully-connected Ising model [24] to demonstrate its application.

Inference framework
Let us consider an Ising model consisting of N spin variables s = (s i = ±1) N i=1 and obeying the following distribution: where J ∈ R N ×N and H ∈ R N are termed couplings and external fields, respectively. The setting of the inverse Ising problem is aimed at inferring the couplings and external fields from a given dataset of spin snapshots ( ( Hereafter, the symbol· is used to represent an estimator. This canonical estimator has some desirable properties but is not always appropriate for the inverse Ising framework for reasons such as insufficient sample size or high computational cost. The PL method [11] overcomes this problem by replacing the likelihood with the conditional distribution P s i s \i , J i , H i for each s i , where J i = (J ij ) j is the coupling vector connected to the ith spin, and s \i denotes the spin vector without the ith component. The explicit form is The PL estimator is obtained separately for each i by where PL (x) = −x + log 2 cosh x.
Two remarkable properties are held by the PL estimator. The first is its consistency. When the dataset size M is sufficiently large, the PL estimator converges to the true values {Ĵ PL i ,Ĥ PL i } → {J i , H i }. The second is its locality. Owing to the factorized nature of PL, each coupling vector J i can be assessed independently to overcome the sample size insufficiency and computational cost issues; however, coupling symmetry J ij = J ji is inevitably lost (Ĵ PL ij =Ĵ PL ji in general). This local property of PL also simplifies the theoretical treatment, which inspired explorations of the "optimal cost function" in a class of models with the same locality as the PL case [24,26]. On the basis of these studies, we treat this local learning class and assume that the cost function argument is the only product of the local spin and effective field. The cost function and corresponding estimator are denoted as = (s i h(s \i )) and Ĵ i ,Ĥ i , respectively. When specifying a certain model in the class, the appropriate superscript will be attached as eqs. (5,7).

Teacher-student scenario
Here, we consider the inverse Ising problem in the teacher-student scenario. Specifically, the dataset D M ≡ s (µ) M µ=1 is assumed to be independently identically distributed (i.i.d.) samples from a teacher Ising model with the couplings J * and external fields H * , and a student Ising model attempts to infer the teacher couplings and fields from the dataset. The inference accuracy is quantified by the residual sum of squares (RSS) between the teacher couplings and student's estimator where we defined the following three macroscopic parameters: These are later associated with the order parameters in the statistical mechanical analysis.

Statistical mechanical analysis
Here, we explain the statistical mechanical formulation developed in [24] to analyze the theoretical performance of local learning models. For simplicity of the theoretical analysis, we assume the absence of external fields H = 0 both in the teacher and student models.

General framework
The basic idea of statistical mechanical analysis is to introduce the following Hamiltonian and Boltzmann distribution induced by the cost function : where Here, we reorder the spins and focus on the zeroth spin and its coupling vector J . The external field is set to zero as declared above; Tr J denotes the integration with respect to J with an appropriate measure, the detailed definition of which is given in sec. A. In the limit β → ∞, the Boltzmann distribution converges to a pointwise measure on the estimator , J ) ; thus, the Boltzmann distribution enables the capture of the estimator's properties. We are interested in the thermodynamic limit N → ∞ while keeping α = M/N = O(1); hence, the free energy density f = −(N β) −1 log Z will show the selfaveraging property. Thus, for typical datasets, the free energy density converges to its average in this limit. The averaged free energy density is denoted as where the square brackets [·] D M denote the average over the dataset, which is the average over the teacher Ising model: The average of log Z over D M is, however, generally difficult to compute. The replica method is a prescription to overcome this difficulty and is symbolized by the following identity: Assuming that n is a positive integer, we can rewrite [Z n ] D M as Rewriting this by introducing variables h a = N −1 j=1 J a j s j a and h * = N −1 j=1 J * j s j , which are hereafter called cavity fields, we get At the last equality, we took the summation over the spins except for s 0 , yielding the joint distribution P cav (h * , {h a } n a=1 J * , {J a } n a=1 ) of the cavity fields. The normalization constant Z 0 is defined through its marginal as To proceed further with the computation, we need to specify the functional form of the cavity field distribution and take the average over it. This is possible when the teacher is a fullyconnected model, as demonstrated in [24]. We review this result below as it contains some steps essential for the sparsely-connected case.

Revisit the fully-connected case
When the teacher is a fully-connected model, we can use the central limit theorem and assume that the cavity fields are multivariate Gaussian variables with appropriate covariances and means. In [24], the authors assumed that the teacher system is in the paramagnetic phase and the replica symmetry (RS) holds in both the student and teacher systems. These assumptions imply that the following four order parameters are sufficient to describe the free energy density: where C \0 is the correlation matrix between the spins: where · · · \0 denotes the average over the teacher Ising model without the zeroth spin; the last equality is due to the paramagnetic assumption. The covariances of the cavity fields are where e N S(C \0 ,J * ,Q,q,m) ≡ Tr Deferring the detailed computations to sec. A, we immediately have the result in the limit n → 0: where we introduce a Gaussian measure Further, we take the limit β → ∞, which requires the following relation: After straightforward calculations, we get where α = M/N and Extr x denote the extremization condition with respect to x. The extremization condition yields the following equations of state (EOS): whereŷ (z, Q, χ, m) = arg max Using the solution of eq. (30), we can evaluate the macroscopic parameters (9) by These relations can be derived by a standard technique using auxiliary variables [24] and the derivation is given in sec. B. Once the values of R and ρ are obtained, the RSS is eventually computed by eq. (8). Note that the values of Q * and R * are directly determined by the problem setting, as well as the inverse correlation function C \0 −1 . To obtain these quantities, we need to separately solve the direct problem.

Details of the sparsely-connected case
This section provides the extension of the above result to the sparsely-connected case, which is the main contribution of this paper. To this end, we introduce an ansatz about the estimator's behavior as well as the functional form of the cavity field distribution. Under the ansatz, the cavity field is decomposed into a signal and a noise, and it is shown that the noise part obeys essentially the same EOS as the fully-connected case. To complete the computation under the ansatz, the tree-like structure of the coupling network of the teacher model is employed.

Ansatz for the sparse case
In contrast to the fully-connected case, the cavity field distribution in the sparse case cannot be regarded as Gaussian; the distribution of h * actually becomes the sum of a few pointwise measures, which is far from Gaussian. Hence, we need a new ansatz to handle the cavity field distribution in the sparse case.
To obtain an idea of how to resolve this, let us consider an ideal situation where we know which couplings are nonzero. Let us suppose that the zeroth spin is connected to c = O(1) neighboring spins, and introduce two sets of indices Ω = {i|J * i = 0, i ∈ {1, · · · , N − 1}} and Ω = {i|J * i = 0, i ∈ {1, · · · , N − 1}}, where Ω (Ω) is called the active (inactive) set; |Ω| = c and |Ω| = N − 1 − c. If we know Ω andΩ in advance, then the inference should be conducted only on {J i |i ∈ Ω}. Accordingly, the number of variables to be inferred is just c = O(1); hence, the dataset size M = O(N ) is sufficiently large. Thus, we can apply the asymptotic theory of statistics, which implies that an estimator in this ideal case behaves aŝ this is called an oracle estimator. The "error" from the true solution, ∆ i , is a random variable.
In the local learning class with appropriate (consistent) cost functions such as PL [27], ∆ i is considered to be zero mean with variance decreasing at the rate of O(M −1 ) = O(N −1 ). The RSS is written as E = i∈Ω ∆ 2 i = O(N −1 ), and vanishes in the thermodynamic limit. Based on these observations about the oracle estimator, we assume that the (non-oracle) estimator obtained from consistent cost functions obeys the following form: where we again assume that ∆ i is a random variable which is asymptotically zero mean with variance scaled as O(N −1 ); the correlations among {∆ i } i are also assumed to be sufficiently weak. The quantityJ i is interpreted as the mean value of the estimator and will deviate from the true value J * i owing to the extensive number of noise terms {∆ i } i . The values of {J i } i∈Ω are later computed by taking the minimization condition of the free energy as the order parameters. The applicable range of this ansatz is discussed in sec. 3.3.
Let us examine the consequence of the ansatz. The RSS can be written as Now, there are two non-negligible contributions to the RSS coming from the bias in Ω and the noise inΩ; the RSS remains finite even in the limit N → ∞ in contrast to the ideal case. The ansatz also allows us to decompose the cavity field as where h a ∆ is termed as the "noise" part. Furthermore, we can assume that h Ω and h a ∆ are asymptotically independent in the limit N → ∞. This assumption is reasonable, and a schematic to explain this is given in Fig. 1. Owing to the tree-like structure, we can define the generation g of a spin s from Ω as the shortest path length between s and any spin in Ω along the network. As g grows, the correlation with {s i |i ∈ Ω} decays exponentially fast, while the number of spins belonging to generation g exponentially increases. If the correlation decay is sufficiently faster than the increase of the spins, then the majority of spins in the network can be regarded as uncorrelated with Ω. Some terms in h a ∆ are certainly correlated with h * , but their contribution ) and the number of correlating terms is O(1) owing to the sufficiently fast decay of correlations. Hence, the contribution of correlating terms vanishes and the uncorrelated majority with Ω completely dominates h a ∆ in the thermodynamic limit. These Ω Ω Figure 1: Schematic explaining the independence between (s 0 , s Ω ) and h a ∆ . The majority in the sum in h a ∆ is uncorrelated with spins in Ω and dominates h a ∆ in the thermodynamic limit.
observations indicate that eq. (17) can be now decomposed as follows: In the second line, we performed the variable transformation ∆ a = J a −J and neglected the contribution j∈Ω ∆ a j s j in h a ∆ as eq. (37). In the third line, we denoted s Ω = {s i |i ∈ Ω} and sΩ = {s i |i ∈Ω}, and performed the sum over sΩ, yielding the joint distribution In the fourth line, we used the asymptotic uncorrelatedness between h a ∆ and (s 0 , s Ω ) discussed so far. Now, the central limit theorem can be applied to the noise part {h a ∆ } a , and they can be regarded as Gaussian variables. As the fully-connected case, two order parameters describing their covariances are introduced: Counterparts of m and Q * are unnecessary because the dependence on (s 0 , s Ω ) is separately and explicitly treated in the present formulation. Then, where Again, using the techniques in sec. A we get Employing the relation (28) and taking the β → ∞ limit, we get The extremization condition with respect to Q and χ gives Further, the mean estimates {J j } j∈Ω are also evaluated by the extremization condition. The result forJ j is given by Using the parameters Q, χ, {J i } i∈Ω computed by eqs. (47-49), we can evaluate the RSS which is expressed in the present setting as The quantity C \0 −1 will be computed in another discussion on the direct problem as the fullyconnected case. In addition, P Ising (s 0 , s Ω |J * ) will also be assessed separately in the sparse case. These points are addressed in the next subsection.

Direct problem's properties
The inverse problem essentially requires certain information from its direct problem counterpart. Necessary information to compute the quantities of interest depends on the system's properties.
In the fully-connected case, two-body quantities such as C \0 −1 and i,j C \0 ij J * i J * j are sufficient. However, in the sparse case, higher-order information is needed because the central limit theorem does not fully dominate the system. Hence, the functional form of P Ising (s 0 , s Ω ) becomes necessary, as seen in eq. (46). Techniques for computing such quantities in the sparse case largely advanced in the '90-'00s. Here, we quote a portion of the results to compute the necessary quantities. For readers interested in the detailed techniques, please refer to [28,29]. Although these techniques are applied in general situations, to obtain compact analytic forms of the quantities of interest, we rely on the assumptions that the teacher model is in the paramagnetic phase and the external fields are absent.

Marginal distribution of the teacher model
The marginal distribution P Ising (s 0 , s Ω |J * ) is computed by marginalizing the whole distribution P Ising (s|J * ) with respect to sΩ. In general, this operation requires nontrivial computations and the resultant distribution becomes dependent on parameters among the marginalized spins. However, under the present assumptions, such dependencies do not exist and the expression becomes rather simple: where Ω c denotes the union index set of 0 and Ω. This form is applied to eqs. (47,49) to obtain the order parameters.

Inverse correlation function
Next, we compute the inverse correlation function C −1 ; hereafter, we treat the whole system and discard the superscript \0 as it is not essential. The so-called Gibbs free energy G is useful for the purpose: where Hessian of G at the minimum is equal to the inverse correlation function C −1 ij = ∂ 2 G ∂m i ∂m j ; hence, we may concentrate on computing G. For sparsely-connected graphs in which loops can be neglected, the free energy is described by the so-called Bethe free energy, which consists of two contributions corresponding to factor and variable nodes, in the thermodynamic limit. The Bethe free energy of G is known to have the following form: where c i denotes the connectivity or the number of edges connecting to node i, S(m) is the entropy conditioned by the magnetization m: and e and E denote an edge and the set of edges, respectively. Besides, ∂e denotes a pair of spin indices constituting edge e while ∂i represents a set of edges directly connecting to spin s i . b e (s e ) represents a probability distribution of s e ≡ (s i ) i∈∂e and is determined so as to satisfy se b e (s e ) = 1 and se s i b e (s e ) = m i for ∀e and i ∈ ∂e. Taking into account these constraints using Lagrange's multipliers yields an expression where h i→e is an auxiliary external field, often called the cavity field, playing the role of Lagrange's multiplier to match the average values of spins with the given magnetization values. Inserting eq. (55) into the constraint of se s i b e (s e ) = m i and se s j b e (s e ) = m j for ∂e = (i, j) yields determining equations of the Lagrange multipliers as Thus, {h i→e } i,e can have a complex dependence relation in general. As a result, the computation of the G's Hessian becomes difficult, and we cannot have a compact analytic form of the inverse correlation function, although its numerical computation is still possible. Fortunately, under the paramagnet and no external field assumptions, we can assume the smallness of h and m, and linearize eq. (56) with respect to these values, yielding Inserting this into eq. (53) and expanding it with respect to m up to the second order, we get Hence, the Hessian becomes where ∂i denotes the index set of nodes connected to i. This expression can be applied even if there is no edge for (i, j) (i.e., J ij = 0).

Applicable range of the ansatz
Here we consider the applicable range of the ansatz (34). This ansatz is a strong statement since it allows us to not think of any possible biases of the estimators outside the active set Ω. When is this valid? How does it relate to the tree-like network structure?
To obtain answers to these questions, we rethink eq. (49). An important observation is that this equation is merely the zero-gradient condition of with respect to J j , (j ∈ Ω) averaged over the datasets, as shown below. Denoting the empirical average on the dataset D M by · · · D M , and using the statistical mechanical analysis explained so far, we can write the zero-gradient condition with respect to J j for j ∈ Ω as where we put y = s 0 h(s \0 ,Ĵ ) = s 0 iĴ i s i . With this expression, we replace the estimatorĴ with the average over eq. (11), take the average over the dataset, and use the replica method. The result is where y a ≡ s 0 ( i J a i s j ). The ansatz (34) and RS used in sec. 3.1, in short, say where v a , z ∼ N (0, 1). Applying this form and following the same line of computations as in sec. 3.1, we get s P Ising (s J * )e −β n a=1 (y a ) ∂ (y 1 ) This computation naturally leads to the following question: Should we compute all the zerogradient conditions not only for Ω but also forΩ? This point is important because if this is the case, then the ansatz (34) is insufficient as it only suffices those for the active set j ∈ Ω. To be consistent, the answer is considered to be yes in general; hence, we need to take into account the zero-gradient conditions for k ∈Ω. This implies that the ansatz (34) should be modified and we need to introduce mean estimatesJ k for k ∈Ω in general situations.
Fortunately, if the network is tree-like, we can show that all the zero-gradient conditions are automatically satisfied once those for ∀j ∈ Ω are met. Hence, the ansatz (34) is consistent on such networks; we show proof of this below. For technical reasons, we recover the external field H * in the remaining of this subsection. When the external field exists, the student model should also have the external field variable, and hence the replica result is slightly modified. That modification is accomplished by replacing j∈ΩJ j s j with j∈ΩJ j s j +H 0 in eqs. (46,48) and (62). Here,H 0 denotes the mean estimate of the external field variable acting on the focused spin s 0 of the student model, and is determined by the extremization condition of the free energy, yielding Under the above setup, we show the consistency of eq. (34) on tree-like networks. The first step is to write down the zero-gradient condition for k ∈Ω. The result of applying the averages and replica method is the replacement of s j with s k in eq. (61). With this expression, we perform the following transformation: where · · · denotes the average over P Ising (s J * , H * ). By following the same computations so far, the second term vanishes in the limit lim β→∞ lim n→0 lim N →∞ because the coefficient converges to the right-hand side of (64) giving zero. Meanwhile, in the same limit, the first term can be transformed as and the dependence on H * k appears only in the marginal distribution P Ising (s 0 , s Ω J * , H * ). On tree-like networks, the marginal distribution necessarily takes the following form: where h \0 i is the effective field obtained by marginalizing the descendant spins of i, and is usually termed as cavity field. An important point of eq. (67) is the absence of higher-order interactions among active set spins because of the tree-like structure. Hence, the differentiation of H * k appears only through that of the effective fields. Furthermore, owing to the tree-like structure, only one of the effective fields is dependent on H * k . Specifying the corresponding index as j(∈ Ω), we get Hence, the zero-gradient conditions on the inactive setΩ are satisfied once those of the active set Ω hold, proving our statement. The above proof also provides a perspective for loopy graphs. If loops exist, then higherorder interactions emerge in P Ising (s 0 , s Ω ); they generally depend on H * k in a complex manner and yield some additional terms as a result of differentiation. In such situations, additional mean estimatesJ k for k ∈Ω will be necessary to satisfy the corresponding zero-gradient conditions; however, treating all variables inΩ is clearly infeasible. Tailoring good approximations in such cases may be interesting in future work, although in sec. 4.3 we show an example in which our present theoretical treatment becomes a good approximation even for loopy graphs.

Numerical experiments
In this section, we conduct numerical experiments to check the accuracy of the theoretical computations. The actual behavior of the order parameters and related quantities depends on the details of the coupling ensembles. Hence, we treat the regular random (RR) graph and Erdős-Rényi (ER) graph as representative examples of sparse tree-like graphs. The RR graph is characterized by one connectivity parameter c, while the ER graph is characterized by the connection probability p. To keep the generated graph sparse enough in the ER case, we assume the probability is scaled as p = d/N , yielding the mean degree d. Furthermore, we also assume that the couplings of the teacher model have the same probability of taking both signs and the strength is constant: |J * i | = K > 0. The coupling strength K is assumed to be small enough to satisfy the paramagnet assumption of the teacher model. In particular, for the RR graph, the paramagnetic condition is while that of the ER one is For readers interested in the derivation, please refer to [28,29]. The cost function is fixed to that of the PL in the following, as the simplest and commonly used case.
is introduced. By the same reason, the marginal distribution can be simplified by again introducing the cavity field h * = j∈Ω J * j s j as where Z 0 = dh * P cav (h * |J * )2 cosh h * . If the focused spin's connectivity is c, then the cavity field distribution becomes Applying the reduction (72) in eqs. (47,49) with replacement j∈ΩJ j s j →bh * in eq. (48) reduces the computation of mean estimates to that of the bias factorb. The theoretically evaluatedb was compared with that obtained by numerical experiments to check the validity of our theoretical treatment.
The numerical computation of the order parameter Q will be conducted below, but it has some delicate points. In our actual computations, the following procedure was adopted: From the generated teacher model we first compute the inverse correlation function C \0 −1 by the cavity formula (59) and numerically invert it to obtain C \0 ; then we introduce from the learning resultĴ and the mean estimateJ which is obtained asJ =bJ * into which the theoretically evaluated value ofb is inserted; finally we get a numerical value of Q by Although it is also possible to evaluate C \0 by the Monte Carlo (MC) sampling instead of using the formula (59), this method is better for controlling fluctuations and reducing computational cost.
The actual experimental procedures are summarized as follows. We first generated a random graph and the teacher couplings on it, and obtained spin snapshots using MC sampling. Then, we randomly chose a center spin s 0 from the whole spins and learned the couplings connected to s 0 by minimizing the PL cost function defined with a dataset obtained from the sampled spin configurations. This single sequence of operations provided single values of the quantities of interest, such as E and Q. To obtain the error bars of those quantities, we repeated this sequence many times. Here, the experiment had three different sources of fluctuations: the generated teacher model (graph shape and couplings), the choice of the center spin, and the MC sampling. We did not discriminate between these three fluctuations unless explicitly mentioned, and we defined the error bar as the standard error among the obtained values according to their recurrence; the number of datasets obtained this way is hereafter denoted as N set . In the MC sampling, we started from a random initial configuration and updated the state by the standard Metropolis method; one MC step (MCS) is defined by N trial flips of spins, where N is the total number of spins. We discarded the first 10 5 MCSs as burn-in to avoid systematic errors from the initialization. Furthermore, to avoid possible correlations in samples, each dataset for learning was generated by subsampling from a much larger dataset, which consists of all the configurations recorded after every few numbers of MCS. The size of the subsampled dataset was chosen to be at least five times smaller than that of the larger dataset. The optimization algorithm is a standard trust-region method using the second-order expansion of the objective function.

RR graph case
In the case of the RR graph with connectivity c, using eq. (59) the trace of the inverse correlation function becomes Substituting this in conjunction with the parameters obtained by eqs. (47-49) into eq. (50), we obtain the RSS. Below, we compare these theoretical values with the numerically evaluated ones. We start by comparing the theoretical and numerical values of E, Q, andb. In Fig. 6, these quantities are plotted against α for K = 0.2 and 0.4 at N = 200 and c = 3. In all the plots, the  The agreement between them is fairly good. The left and middle panels are plotted in the double log scale because E and Q drastically diverge in the limit α → 2. The error bars obtained from N set = 100 datasets are shown, although they tend to be comparable with the size of markers. agreement between the theoretical (dotted lines) and numerical (color markers) results is fairly good, supporting the validity of our analytical treatment.
Next, we consider the distributions of the estimators in Fig. 3, which were normalized as probability distribution functions. The left panel is the distribution of the estimators on the  0.4, 5, 3). The middle and right panels imply that the noise parts obey the zero-mean Gaussian distribution and have no discriminative difference between the active and inactive sets. Here, the histograms are generated from N set = 500 datasets; from each dataset, the number of obtained estimators is c = 3 for Ω while that forΩ is N − c − 1 = 196. active set Ω. We can observe that two peaks are located around the theoretical prediction ±bK. In the middle panel, the estimator distribution on the inactive setΩ is shown, yielding a Gaussian-like distribution with zero mean. Similar behavior is observed for the noise part on the active set, {∆ i =Ĵ i −J i } i∈Ω , the distribution of which is given in the right panel. Here, the mean estimates {J i } i∈Ω are computed by multiplying the theoretically evaluated biasb by the true coupling {J * i } i∈Ω . These observations are once again consistent with our theoretical analysis.
Thirdly, we check the finite size effect. In Fig. 4, against the system size N , the RSS and rescaled variance (multiplied by N ) of the noise parts∆ =Ĵ −J are plotted in the upper and lower panels, respectively. Although the finite size effect behaves in different ways depending on the parameters and quantities, we can see that the numerical results (markers) fairly matched the theoretical values (black dotted lines) as the system size is large. Here, the rescaled variance corresponds to the quantity Q Tr C \0 −1 /N in our theoretical computation, which is consistent with eq. (50). These results again confirm the validity of our computations. Finally, we have some noteworthy remarks. The results shown in Figs. 3 and 4 imply the possibility of an efficient method of debiasing and hypothesis testing. The bias factorb can be computed from our analytical result, and hence we can debias our estimator in an efficient manner. The residual after debiasing∆ is considered to obey a Gaussian distribution, as shown in Fig. 3, and is supported by our analytical computations in sec. A. Thus, we can efficiently compute the P-value according to the standard hypothesis testing method, enabling us to judge the relevance of the estimated couplings. Moreover, in the thermodynamic limit N → ∞, we can show that the perfect reconstruction of the teacher's network is possible for any α > 2. To do so, we need to evaluate the probability of getting false positives in the estimator. To control false positives, we introduce a constant threshold value K th (> 0), and consider estimated couplings with absolute values less than K th as negligible and set to zero; we independently repeat this procedure for all i = 1, · · · , N . Let us evaluate the probability of successfully screening out false positives using this method. The observations so far imply, on the inactive setΩ, that the estimator behaves asĴ where σ 2 i (= O(1)) is the rescaled variance of the estimate, with relation (1/N ) i∈Ω σ 2 i ≈ Q Tr C \0 −1 /N . Hence, the probability of successfully screening out these estimators onΩ is The second approximate equality comes from the asymptotic formula of the integral, which can be justified for large N . The last limiting behavior holds as long as σ i is bounded from above, because the exponential factor e − 1 2 N σ 2 i K 2 th decays fast enough compared with the number of products |Ω| = N − c − 1. Hence, we can completely suppress the false positives in the limit N → ∞. Meanwhile, we also desire to accurately reproduce the presence of couplings on Ω. This can be done by tuning the threshold value K th as smaller than the true coupling strength K (the mean estimatesJ are larger than K in the absolute value). In practical situations, we do not know the true coupling strength K in advance, and thus it is nontrivial to correctly tune K th . In such cases, it may be better to tune K th by monitoring the distribution of estimators such as Fig. 3, and to find a value that effectively separates the modes of distribution. The present theoretical analysis supports this process by manifesting the behavior of estimators in the limit N → ∞.

ER graph case
For the ER graph with connection probability p = d/N , the evaluation of the order parameters and related quantities is slightly more complex than the RR case because of the distributed nature of the connectivity. In the thermodynamic limit, the distribution of connectivity c in the ER graph obeys the Poisson distribution: The trace of the inverse correlation function fortunately becomes simple in the limit: When focusing on spin i with connectivity c i in the ER graph, its associated order parameters are computed by eq. (47) with P (h * |c i ) defined in eq. (73), and the RSS is given by This explicit dependence of the order parameter on c i is the complex point of the ER case. Upon realizing this property, we can easily compute the mean RSS for the whole network, which is written as These provide explicit formulas of the RSSs for the ER case.
As an interesting departure from the RR case, we here examine the connectivity dependence of our quantities of interest. The plots of E(c), Q(c), andb(c) at (N, α, d, K) = (400, 10, 4, 0.4) are given in Fig. 5. In this experiment, we generated ten different ER networks, performed two independent MC samplings, and conducted learning for all i = 1, · · · , N . The error bars were placed using the obtained datasets, and thus N set varied among different connectivity c.
The agreement between the theoretical and numerical results is fairly good, supporting our theoretical result. Although a slight deviation at large c in E(c) and Q(c) was observed, this was attributed to the finite size effect, which increased at large c because of insufficient system size for generating nodes with large c.
We also computed the mean RSS (80) for the whole network. The theoretical value is E mean = 0.4780, while the present experimental value is E mean = 0.4907 ± 0.0041. The slight difference between these is again attributed to the finite size effect. Here, the theoretical value was obtained by taking the sum of eq. (80) up to c = 20; the effect of this truncation was found to be small.

Square lattice case for comparison
The cavity method in direct problems is known to yield good approximations even for loopy graphs, when correlations among spins are weak; it is sometimes referred to as Bethe approximation. Here, we examined this approximation nature of the present theoretical computation of inverse problems. To this end, we compared our theoretical result for c = 4 with the simulation result on the square lattice with periodic boundary condition. To avoid possible complexity due to frustration, the present teacher couplings were assumed to be all positive and constant, J * i = K > 0, (i ∈ Ω). In Fig. 6, we plotted E andb against α for K = 0.2 on the square lattice of size 20 × 20, in comparison with our theoretical result (dotted line) computed with the assumption of the tree-like network structure. The agreement between the theoretical and numerical results is fairly good. This indicates that our theoretical result can be a good approximation even for loopy graphs.
Another interesting phenomenon for loopy graphs is the possible presence of bias in the estimated couplings for spins inΩ, as discussed in sec. 3.3. To examine this, in the upper panels of Fig. 7 we show the distributions of the coupling estimates corresponding to the next nearest neighbors (NNN) from the center spin s 0 in the teacher model for the square lattice (left) and for the RR graph with c = 4 (right). To make a fair comparison, the present teacher couplings for the RR graph case are all positive and constant as the square lattice case. These two distributions are very similar, implying that the bias in coupling estimates for remote spins is, even if it exists in loopy graphs, very weak for the present situation. For further quantitative information, the means of those distributions were plotted against the system size in the lower panels. Again, we observed no clear deviation from zero and no significant difference between the two cases of the square lattice and RR graph. These suggest the practicality of the theoretical results for wider situations than tree-like networks.

Summary and discussion
We proposed a theory to evaluate the reconstruction performance in inverse Ising problems with sparse couplings by maximizing of the pseudolikelihood in the thermodynamic limit. A large part of the theory relies on the statistical mechanical formulation in [24], but we refined the theoretical treatment in the cavity method to handle the teacher model with sparse couplings. The resultant expression requires a full functional form of the cavity field distribution, which is far from Gaussian but was obtained by appropriate consideration of the direct problem counterpart. The theoretical result shows fairly good agreement with numerical experiments conducted on the RR and ER graphs, justifying our theoretical treatment. This agreement holds even for the case of the square lattice, suggesting the practicality of the present result as an approximation for loopy graphs. The crucial assumptions of our treatment are the asymptotic behavior of the estimator (34) and the paramagnet assumption of the teacher model, leading to the decoupled distributions of the cavity fields. The former assumption implies that the teacher's couplings can be reconstructed by the student almost perfectly, as discussed at the end of sec. 4.1 according to the hypothesis testing framework, providing a theoretical reasoning to use the inverse Ising framework. This partly explains the experimental result showing excellent performance in reconstructing a neuron coupling network [10]. The latter assumption requires the smallness of the coupling strength, implying that strongly-correlated datasets cannot be treated by the proposed theory. It will be a challenge to overcome this applicability limitation.
For handling real-world datasets, finite magnetizations as well as possible loop structures in the network should be taken into account. For such realistic situations, the computation of (1/N ) Tr C −1 and P cav (h * |J * ) will be more complicated, and the ansatz (34) should be also modified for the case of loopy graphs, as discussed in sec. 3.3. The presented result can be still practical as an approximation for treating such situations, as demonstrated in sec. 4.3. Certain data analysis utilizing these theoretical results will be interesting and useful.
A clear drawback of the estimator treated in this paper is that it is not informative in the region α ≤ 2, as indicated by the divergent RSS in the limit α → 2 + 0 shown in sec. 4 and [24]. To overcome this, the use of regularizations will be promising. The 1 regularization will be particularly useful to control false positives in the estimated couplings. It is also possible to employ hypothesis testing in conjunction with 2 regularization. An extension of the present analysis to these cases is interesting and will be our focus in future work.
The inverse Ising problem or Boltzmann machine has been treated in this paper. Although this model is much simpler than the current models of machine learning communities, we believe that it is important to enhance theoretical knowledge on such simple models to maintain the reliability and interpretability of the results given by machine learning technologies. We hope that our study contributes to this direction, which will lead to a better understanding and relationship with more complex models. and performing the integration with respect to v * , we get L(Q * , Q, q, m) = Dz e where we use the relation Z 0 = 2e 1 2 Q * , which was canceled with a factor appearing by the integration of v * . Eq. (26) is easily derived from this.
For computing the entropic term S(C \0 , J * , Q, q, m), we use the rescaled variable W = √ N J and set the integration measure as Tr J = dW . Further, we represent the delta functions by the Fourier expressions as follows: where the integration contour is the imaginary axis and C 1 , C 2 are appropriate normalization constants; however, these points are irrelevant and ignored hereafter. Inserting this into eq. (23), we get Note that this quadratic form with respect toW implies thatW essentially obeys Gaussian, and thus the estimatorĴ also does. This knowledge of the distribution can be used for hypothesis testing as addressed in the main text.
In the thermodynamic limit N → ∞, we can use the saddle-point (or Laplace) method to avoid the explicit integrations with respect toQ,q,m. This yields S = Extr Q,q,m S X + S J N = Extr Q,q,m 1 2 nQQ − 1 2 n(n − 1)qq − nmm − 1 2 log 1 − nq Q +q where we used the relations i λ i W * i 2 = N Q * , i log λ i = Tr log C \0 . The limit n → 0 leads to lim n→0 S n = Extr Q,q,m 1 2Q Q + 1 2q q −mm + 1 2q and the extremization condition gives Substituting these relations into eq. (90), we obtain eq. (25). If we ignore the terms related to m andm, we have eq. (44).

B Derivation of macroscopic parameters R and ρ
To derive the expressions of R and ρ, we can employ the technique of auxiliary variables. We introduce two terms h R a (W a ) W a and h ρ a (W * ) W a in eq. (87), and perform the same line of computations as in sec. A. As a result, the entropic term is modified to the following expression: Taking the differentiation with respect to h ρ and taking the limit h ρ , h R → 0, we get The last expression is obtained by using eq. (91). Similarly,