The planted k-factor problem

We consider the problem of recovering an unknown k-factor, hidden in a weighted random graph. For k = 1 this is the planted matching problem, while the k = 2 case is closely related to the planted traveling salesman problem. The inference problem is solved by exploiting the information arising from the use of two different distributions for the weights on the edges inside and outside the planted sub-graph. We argue that, in the large size limit, a phase transition can appear between a full and a partial recovery phase as function of the signal-to-noise ratio. We give a criterion for the location of the transition.


Introduction
The problem of recovering a hidden structure (the signal) in a graph on the basis of the observation of its edges and the weights on edges and vertices appears in many, diverse contexts. Community detection [1], group testing [2], certain types of error correcting codes [3], particle tracking [4] are some example of statistical inference problems formulated on graphs in which some underlying pattern has to be identified. The feasibility of the hidden structure recovery depends, of course, on the amount of 'noise' in the problem. It turns out that, in the limit of large system sizes, sharp recovery thresholds with respect to the signal-to-noise ratio can appear. These sharp thresholds separate no recovery phases (in which no information on the signal is accessible), partial recovery phases (in which an output correlated with the signal can be obtained) and full recovery phases (in which the signal is recovered with a vanishing error).
These algorithmic transitions, analogous to phase transitions in physics, can be of different orders (depending on the number of derivatives of the order parameter that exist). The order of the transition seems to have important algorithmic implications. First order transitions, in particular, have been related to the presence of a computationally hard phases [1]. In the stochastic block model on sparse graphs, for example, a phase transition between a no-recovery phase and a partial recovery phase is found [1]. In [5,6] it has been shown that a partial to full recovery transition can also appear in the same model if denser topologies are considered. A partial to full recovery first order transition appears in low-density-parity-check error correcting codes [3], where the target is to correctly recover a code-word. Somewhat uncommonly, a partial to full recovery infinite order phase transition has been recently found in the planted matching problem, in which a hidden matching has to be detected in a weighted random graph [7][8][9]. This is a new type of a phase transition found in inference problems that has been put in relation with the presence of instabilities of the belief propagation (BP) fixed points. It is thus of interest to investigate whether it appears in other problems than the planted matching.
In this paper we study a generalization of the planted matching problem-the so-called planted k-factor problem. A k-factor of a graph G is a spanning subgraph with fixed degree k, i.e., a spanning k-regular graph. In this problem, the (weighted) k-regular graph is hidden (planted) by adding new weighted edges to it. The weights of the planted and non-planted edges are random quantities with different distributions. The goal is to recover the planted k-factor knowing k and the two weight distributions.
In analogy with percolation being continuous while the appearance of a k-core is a discontinuous (first order) phase transition [10], we may have anticipated that the transition in the k-factor problem will be of a different type than in the planted matching. We will show instead that the planted k-factor problem manifests a partial-full recovery continuous phase transition akin the one in the planted matching, and we will give a criterion for the threshold between the two phases.
The planted k-factor problem is related to many interesting inference problems. For example, it shares some similarity with the problem of structure-detection in small-world graphs [11]. In the latter case, a k-regular ring lattice is hidden in a (unweighted) random graph by a rewiring procedure. Full recovery is possible depending on the parameters of the rewiring process.
The planted one-factor problem corresponds to the aforementioned planted matching problem, introduced in [4] as a model for particle tracking. In this model, a weighted perfect matching, hidden in a graph, has to be recovered. In reference [4] the problem was studied numerically for a particular case of the distribution of weights. The results suggested the existence of a phase transition between a full recovery phase and a partial recovery phase. More recently, Moharrami and coworkers [7] rigorously analyzed another special setting of the problem assuming that G is the complete graph and the weights are exponentially distributed, and proved the existence of a partial/full recovery phase transition. The results of reference [7] have been generalized to sparse graphs and general weight distributions in reference [8], via heuristic arguments based on a correspondence between the problem and branching random walk processes. A proof of the results in [8] has been recently given in [9].
The planted two-factor problem is a relaxation of the planted traveling salesman problem [12]. In this problem, a unique, hidden Hamiltonian cycle visiting (exactly once) all vertices of a graph has to be recovered. In the planted two-factor problem, instead, the planted subgraph is more generally given by a set of cycles. Solving the two-factor problem can be, however, informative on a hidden Hamiltonian cycle. In reference [12] the Hamiltonian cycle recovery problem has been studied on the complete graph. Therein, a sufficient condition for the full recovery of the planted Hamiltonian cycle has been derived. Moreover, it is proved therein that, within the full recovery phase, the solution obtained searching for a two-factor coincides (with high probability and in the large size limit) with the hidden Hamiltonian cycle.
In this paper, we generalize the available results for the planted one-factor problem and planted two-factor problem to the general planted k-factor problem with arbitrary distributions for the edge weights and including sparse graphs in the analysis. Unlike reference [11], we will assume no prior knowledge on the structure of the planted k-factor (except for the degree of its vertices). Our approach, based on the standard cavity method and the corresponding BP equations [13], allows us to obtain, at the level of rigor of theoretical physics, a criterion for the full recovery of the planted subgraph in the large size limit of the problem. The threshold criterion is derived studying the recursive distributional equations (RDE) corresponding to the cavity equations. It turns out that a 'drift' in their solutions can appear under iteration. If this is the case, the full-recovery solution is the only stable one, and full recovery is possible. The study of such drift has been tackled in analogy with the analysis in reference [8] for the k = 1 case, and with the phenomenon of front propagation for reaction-diffusion equations [14][15][16][17][18]. We give an explicit criterion for the threshold between a partial recovery phase and a full recovery phase of the planted k-factor. Our results recover, as special cases, the ones obtained in references [7][8][9] for the planted one-factor. In the limit of dense graphs, they provide a sharper characterization of the phase transition for k = 2 with respect to the analysis in [12], as discussed in more detail in section 6.1.
The rest of the paper is organized as follows. In section 2 we define the problem under study and introduce two adopted statistical estimators, namely the block maximum a posteriori (MAP) and the symbol MAP. In section 3 we present the BP equations for the solution of the problem and their corresponding probabilistic description. In section 4 we numerically study a specific case, observing that a transition between a full recovery and a partial recovery phase can appear as function of the parameters of the problem. In section 5 we give a heuristic derivation of the criterion for the location of the transition for arbitrary weight distributions for the block MAP estimator, equation (52), which is the main result of the paper. In section 6 we compare our theoretical predictions with the numerical results obtained for different variants of the problem, including the case considered in section 4 and the Hamiltonian cycle recovery problem considered in reference [12]. Finally, conclusions and perspectives for future work are given in section 7.

Planting a k-factor
Let us assume that an integer k ∈ N and two probability densities p andp on the real line are given. We will focus on an ensemble of weighted simple graphs denoted G 0 = (V 0 , E 0 , w), containing by construction a planted k-factor to be recovered. Here V 0 = {1, . . . , N} is the set of N ∈ N vertices such that kN is even, and E 0 is the set of edges (unordered pairs of distinct vertices of V 0 ). A weight w e ∈ R is associated to each edge e of the graph, so that w = {w e : e ∈ E 0 } is the set of such weights. We introduce a probability measure over the set of weighted graphs by means of the following generation steps of G 0 (see figure 1(a)).
(a) One first constructs a k-regular graph having vertex set V 0 . The graph is chosen uniformly among all possible k-regular graph with N vertices [19]. This can be achieved using fast algorithms available in the literature [20]. The obtained graph has 1 2 kN edges and edge-set F * k . A random weight w e , generated independently of all the others with densitŷ p, is associated to each edge e ∈ F * k . (b) Given a pair of vertices that are not neighbors in F * k , a link is added between them with probability cN −1 . Let E 0 be the final set of edges of the obtained graph. A weight w e , independently generated from all the others with distribution p, is assigned to each e ∈ E 0 \F * k . We shall call planted (resp. non-planted) edges those in F * k (resp. in E 0 \F * k ). The parameters of this ensemble of weighted random graphs are thus the integers N and k, the parameter c controlling the density of non-planted edges, and the two distributionsp and p for the generation of the weights of the planted and non-planted edges respectively. Given F * k , the probability to generate a graph G 0 is therefore where here and in the following I(A) denotes the indicator function of the event A. The edges in E 0 \F * k form essentially an Erdós-Rényi random graph of average degree c. The edge-set F * k , on the other hand, is a k-factor of G 0 by construction, i.e., a spanning subgraph of G 0 in which all the vertices have the same degree k. The resulting graph has average coordination c + k. Note that, for k = 1, F * 1 is a matching on G 0 , and the introduced ensemble of graphs coincides with the one studied in reference [8] for the planted matching problem.

The inference problem
Given a graph G 0 in the ensemble described above, we wonder if it si possible to infer the k-factor F * k hidden in it. We assume that the generation rules are known, alongside with k, c, p andp. All the exploitable information is contained in the posterior probability P(F k |G 0 ) that a certain k-factor F k in G 0 is the planted k-factor F * k . From Bayes theorem we obtain the following expression for the posterior: where the symbol ∝ hides a normalization constant independent of F k . To parametrize the probability measure above, it is convenient to introduce, for each edge e, the binary variable m e ∈ {0, 1}, so that m = {m e = I(e ∈ F k ) : e ∈ E 0 } ∈ {0, 1} |E 0 | , and rewrite the posterior as where ∂i denotes the set of edges incident to the vertex i. We want to compute an estimatorF k that is 'close' to the hidden k-factor F k . The estimator is associated to the set of binary variablesm that encodes the set of edges inF k . With a slight abuse of notations, in the following we identify a set of edges F k with its corresponding m. We will denote m * the set of variables associated to F * k andm the set of variables associated to an estimatorF k . In this paper, we will quantify the distance between an estimatorF k and the true planted k-factor F * k in terms of the cardinality of the symmetric difference F * k F k between F * k andF k , or equivalently the Hamming distance between the binary string m * encoding F * k and the binary stringm encodingF k . We will consider two 'maximal a posteriori' (MAP) estimators, each one minimizing a 'measure of distance' with the planted k-factor.
• A first possibility is to choose as estimator the k-factor that minimizes the probability P(F * k =F k ) over all realizations of the problem, This estimator is usually called 'block MAP' [3]. • A different estimator, called 'symbol MAP' and denoted in the followingF (s) k , is obtained requiring that the distance to be minimized is precisely the error in equation (4). In this case, for each edge e ∈ E 0 , we choosê with P e the marginal of the posterior probability (3) for the edge e. Observe, however, that this estimator is not necessarily a k-factor.
In the following, we will discuss both the estimators defined above, in the thermodynamic limit N → ∞, as a function of the parameters of the model. As in the planted matching problem [4,7,8], the possibility of identifying the planted edges will depend on the similarity between the distribution p and the distributionp, and on the parameter c, that corresponds itself to a noise level expressing the number of confusing non-planted edges introduced in the graph.

Pruning the graph
To efficiently solve the problem, and possibly reduce the size of the input, it is convenient to proceed with a preliminary 'pruning' of the graph. Before proceeding further, let us assign a 'capacity variable' κ i = k to each vertex i. The capacity of each node i will take into account the number of unidentified planted edges incident at i. Let be the support ofp and p respectively, and let be the intersection of these supports. We suppose that Γ = ∅ (the inference problem is otherwise trivial).
If an edge e bears a weight w e ∈ supp(p)\Γ, then it can be immediately identified as 'nonplanted', and removed from G 0 . This event will happen with probability 1 − μ, where On the other hand, the set of edges surely belongs to the planted configuration, These edges can be correctly classified as 'planted' with no algorithmic effort, except for the inspection of their weights. A planted edge e can be therefore identified, solely on the basis of the value of its weight, with probability 1 −μ, whereμ We can remove from the graph the identified planted edges, see figure 1(b). We must take care, however, of reducing at the same time by 1 the capacity of the endpoints of a planted edge that is removed. At the end of this process, the capacity of a generic vertex i is 0 κ i k, see figure 1(c). For large N, the capacity of the vertices after this pruning has binomial distribution Bin(k,μ). In particular, (1 −μ) k N vertices have zero capacity, meaning that all their incident planted edges have been identified. These vertices can also be removed, alongside with all their remaining (non-planted) incident edges.
In the resulting pruned graph, a vertex has K incident planted edges and Z non-planted edges, where K and Z are two random variables having distribution The distributions of the weights of the surviving edges are obtained from the original ones conditioning the weights to be in Γ, i.e., for the non-planted and planted edges, respectively. We will denote G = (V , E , w) the pruned graph, with V ⊆ V 0 and E ⊆ E 0 the new vertex and edge-sets. For large N, V has cardinalitŷ N :=μ k N, each vertex i ∈ V having capacity 1 κ i k distributed as K. The graph will have a total of 1 2 kμN surviving planted edges and 1 2 γμ k N surviving non-planted edges.

The belief propagation equations
To write down a BP algorithm for the planted k-factor problem, we start from the probability distribution over the configurations m = {m e : e ∈ E} ∈ {0, 1} |E| on the edges of the pruned graph G, where β > 0 and ω e = ω(w e ), with The quantities ω e play the role of effective weights on the edges of the graph. The introduction of β is convenient because the measure in equation (10) coincides, for β = 1, with the posterior defined in equation (3). On the other hand, for β → ∞ the measure concentrates on the configurations maximizing the posterior, i.e., on the block MAP. Equation (10) can also be associated to a graphical model. In particular, we can associate a variable vertex m e ( ) to each edge e of G. We also introduce two types of interaction vertices. A first type of interaction vertex ( ) is associated to each i ∈ V, and linked to all variable vertices m e such that e ∈ ∂i. Such vertex imposes that κ i variables m e , e ∈ ∂i, are equal to 1. A second interaction vertex expresses the contribution e −βm e ω e ( ) for each e, and it is linked to the variable vertex m e . Pictorially, The BP algorithm [13] provides a recipe for the computation of the marginals of ν on such factor graph. The idea is to approximate the marginal ν e (m) for the edge e = (i, j) as where ν i→e (m) (respectively ν i→e (m)) mimics a marginal probability in graphical models in which j (respectively i) is absent. Such factorization is exact on infinite trees. The algorithm goal is the computation of such messages, and it is conjectured to be exact in the large size limit for sparse random graphs. The messages obey the following equations (one for each directed edge of the graph), We adopt the convention a∈A f(a) = 0 and a∈A f(a) = 1 if A = ∅ for any function f. Pictorially, equation (12) can be rendered as where the arrows indicate the directions of 'propagation' of the messages. Taking advantage of the binary nature of the variables m e , it is convenient to parametrize the marginals in terms of 'cavity fields' h i→e , so that an equation for the cavity fields h i→e can be written as Once the equations have been solved for all the fields on the graph edges, the marginal probability of the variable m on the edge e = (i, j) is given by i.e. ν e (1) evaluated with β = 1 is the probability that e ∈ F * k . The BP approximation to the symbol MAP estimator in (6) is obtained computing the messages, and then the marginal, with β = 1, and then selecting the set Observe that proceeding in this way the selected edge-setF (s) k ∪ F (0) k is not an actual k-factor in general.
The block MAP estimator is obtained taking β → +∞ in the equations for the marginals: in this limit the measure in equation (10) concentrates on the configuration {m e } e that maximizes the likelihood. For β → +∞, equation (14) simplify, and we obtain where min (r) [A] outputs the rth smallest element of the set A. The block MAP estimatorF (b) k is found using the same criterion given in equation (16) upon convergence of the algorithm, so that the final estimator for

Recursive distributional equations
In this section we will study the average error on the considered ensemble by analysing the statistical properties of the solutions of the BP equations by the cavity method [21]. We introduce the following random variables.
•Ĥ is a random variable that has the law of the cavity field h i→e given that e is a randomly chosen planted edge; • H is a random variable that has the law of the cavity field h i→e given that e is a randomly chosen non-planted edge; •Ω is a random variable that has the law of the effective weight ω e given that e is a randomly chosen planted edge; • Ω is a random variable that has the law of the effective weight ω e given that e is a randomly chosen non-planted edge.
In the hypothesis that the replica symmetric hypothesis holds (i.e., typical realizations of ν have no long-range correlations), then (14) translates into RDEs involving the introduced random variables H andĤ. To write down this set of RDEs, first observe that an endpoint i of a planted edge e is incident to Z non-planted edges, plus a set K − 1 of other planted edges. The random variable K , however, is not simply distributed as K, but instead as [13] This is because the planted subgraph, havingμ k N 1 vertices, contains 1 2μ k NκP[K = κ] edges adjacent to a vertex of capacity κ, so that the probability of picking a planted edge that is adjacent to a vertex with capacity κ is proportional to κP[K = κ]. For the sake of brevity, here and in the following, given a random variable X, we will denote by X a random variable distributed as Similarly, if e is a non-planted edge and i is one of its endpoints, there will be K planted edge incident to i, and other Z − 1 non-planted edges. Within the replica-symmetric assumption of independence of the incoming cavity fields, one thus obtains from equation (14): The equalities have to be considered in distribution. In the equations above all random variables are independent, Z is Poisson distributed with mean γ, the Ω i 's have the same law as Ω, H i are independent copies of H, and, similarly,Ĥ i ofĤ. Finally, the variable K is distributed as in equation (8a). Observe that being Z Poisson distributed, Z − 1 d = Z. Note also that, in the β → +∞ limit, equation (20) simplify, givinĝ Recalling the inclusion rule (16), the average reconstruction error is given as

Hard fields
At this point, we aim at evaluating E[ ] by solving, possibly numerically, the RDEs given in equation (20). It is, however, convenient to first isolate the contribution of 'hard-fields'.
To obtain the equations obeyed by q,q, H andĤ, it is convenient, in this sense, to observe that the correction terms being finite. From these equations we easily get that that is a set of equations for q andq. Introducing the variables Z distributed as and the variable K distributed as we can reduce ourselves to equations involving 'soft fields' H andĤ only. Using the notation introduced in equation (19) we can writê In the β → +∞ limit, It is important to note that the solutionĤ = +∞ and H = −∞ is an admissible solution of the RDEs for any value of β. This solution corresponds to a full recovery phase, in which the fraction of planted edges that are not correctly recovered vanishes as the system size grows. The average reconstruction error (22) can be rewritten as This procedure of 'hard-fields' elimination on the RDEs admits also an interpretation on a single graph instance. Infinite fields on the planted edges may appear in the BP equation (14): if a vertex i has coordination equal to its capacity κ i , then h i→e = +∞ for all its incident edges e. This is not surprising: in this case, indeed, all edges incident to i surely belong to the planted k-factor. The vertex i and all the κ i edges incident to it can be removed and the capacity of its neighbors updated. This removal procedure can be iterated until either F * k has been entirely recovered, or a non-trivial core survives in which every vertex i has |∂i| > κ i . The BP algorithm in equation (14) can be runned on the obtained core.
The described pruning appears as a generalization of the known pruning procedure adopted for the study of optimal matching on Erdós-Rényi graphs [22] and planted matching problems [8]. It corresponds to a process of identification of the planted edges merely based on the topology of the graph G (and therefore not related to the weights on the edges). A 'percolation transition' occurs in the parameter of the problem between a phase in which the graph is completely pruned (and the k-factor completely recovered), and a phase in which an (extensive) core survives. The threshold is obtained studying the equation that has q = 0 as attractor for ckμμ 1.
In the considered problem, this condition implies full recovery of the planted configuration by simple topological considerations, i.e., iterative pruning.

A numerical experiment: the exponential distribution case
We shall now numerically investigate the planted k-factor problem on the ensemble of graphs introduced above, using the tools described in the previous section. We will take an uniform distribution for the non-planted weights, and an exponential distribution for the planted weights. It follows that in this case Γ = [0, c], μ = 1,μ = 1 − e −cλ and γ = c(1 − e −kcλ ). In the c → +∞ limit,μ = μ = 1 and q =q = 1∀ k ∈ N. The large N limit results for E[ ] have been obtained by a numerical resolution of the RDEs for the soft fields for β = 1 and β → +∞ (see equation (28)) via a population dynamics (PD) algorithm [21]. The approach consists in introducing an iterative version of the RDEs discussed above that defines a new set of random variables (Ĥ (n) , H (n) ) n , n = 0, 1, . . . . These new random variables satisfy, for β → +∞, with initial conditionĤ (0) = H (−1) = 0. A similar set of iterative RDEs can be written for β = 1 starting from equation (27). The underlying assumption is that, if a non-trivial solution of the original RDEs exists, such solution will be reached in probability by equation (34) for n → +∞. From the algorithmic point of view, the law of the random variable H (n) is represented by an empirical distribution of a sample {h 1 , . . . , h N } of its representants, with N 1, so that Similarly, an empirical distribution is adopted to approximate the law ofĤ (n) . The RDEs (34) are used to update the population representing (Ĥ (n) , H (n) ) to a new population representing the variables (Ĥ (n+1) , H (n+1) ), each update corresponding to an 'iteration' of the algorithm. The algorithm stops when some convergence criterion (usually the convergence of the moments of the populations) is satisfied (see, e.g., reference [13] for additional details). The PD predictions have been compared with actual BP results, obtained solving the inference problem on a large number of instances (see also section 6 for further details about the BP implementation).
In the block MAP case (β → +∞) it is observed that, for a given pair k and c, there exists an interval Λ (b) := (λ − , λ + ) of values of λ, such that for λ ∈ Λ (b) the PD algorithm converges to a finite solution. The corresponding value of E[ ] is found to be non-zero and, within numerical precision, E[ ] → 0 smoothly on the boundary of the interval. In particular, we numerically observe that This implies that E[ ] approaches zero exponentially fast as λ → λ + , i.e., the transition is of infinite order. Remarkably, the very same behavior has been observed in the k = 1 and c → +∞ case [8,9], where it is shown that λ + = 4 and Finally, for c → +∞ it is observed that λ − → 0, whereas λ + approaches a finite limit. The obtained prediction for E[ ] is fully compatible with the BP results, and E[ ] → 0 for λ → λ ± with the size of the considered graphs. On the other hand, for λ / ∈ Λ (b) the PD algorithm does not converge. To be more precise, the population is subject to a 'drift' towards the full recovery solution, H → −∞ andĤ → +∞. This can be seen, for example, in figure 2(c), where some numerical results for c = k = 2 are given in the β → +∞ case forĤ. It is seen that the numerically estimated mean E[Ĥ] is population-size-dependent, and in particular diverges with the population size, whereas the variance is not. Moreover, larger populations correspond to larger values of E[Ĥ].
The numerical results suggest therefore that a nontrivial, attractive fixed point exists for λ ∈ Λ (b) only, otherwise the only attractor being the trivial fixed pointĤ = −H = +∞ corresponding to the full recovery phase. In reference [8] it is argued that, for k = 1, an infinite order phase transition takes place between a full recovery phase and a partial recovery phase, and in particular full recovery is obtained for λ ∈ R + \Λ (b) . The conjectures in reference [8] about the location of the transition and its nature have been recently rigorously proved in reference [9]. Our results strongly suggest that the same phenomenology extends to the k > 1 case. As in the k = 1 case, the accurate numerical determination of the endpoints of Λ (b) is heavily affected by finite-population-size effects using PD (and finite-size effects using BP). Indeed, the transition manifests itself as a front propagation in the cumulative distribution function that drifts towards large values of the fields. Such front propagations are generically driven by the behavior in the exponentially small tail far away from the front [16]. The finite population size induces a cutoff on the smallest representable value of the cumulative distribution function, that translates, assuming an exponential decay of the cumulative, into logarithmic finite population size effects on the velocity of the front and the location of the transition.
The very same phenomenology is observed for β = 1, i.e., for the symbol MAP, where a partial recovery phase λ ∈ Λ (s) = (λ − , λ + ) is surrounded by a full recovery phase. For a given pair k and c of parameters, the symbol MAP transition points are found to be very close to the block MAP transition points obtained for the same values of k and c. Also in this case,  (17) and (14) with β = 1 on graphs of N = 10 3 vertices. The c → +∞ curves are estimated using PD with c = 20 and are compared with BP results obtained solving the problem on complete graphs with N = 10 2 vertices. In the block MAP case, we marked with an arrow the partial to full recovery threshold in the c → +∞ limit predicted to be at λ = 4k. In the insets, plot of λ + − λ ln E[ ] given by the PD algorithm, with λ + estimated using the condition in equation (54). Both in the block MAP case and in the symbol MAP case, the limiting value for λ → λ + is finite, suggesting that the transition is of infinite order. (c) Mean and variance of the variableĤ estimated using a PD algorithm for k = c = 2 for β → +∞. In this case, we numerically find λ + = 7.9(1), whereas the prediction given by equation (54) is λ + = 7.9946 . . . . The results are obtained assuming all the N fields equal to zero as initial condition. An 'iteration' of the algorithm corresponds to an update of all fields of the population by means of the RDEs. For λ = 5, inside the partial recovery phase, the algorithm rapidly converges to asymptotic values that do not depend on the size N of the population. For λ = 10, i.e., in the full recovery phase, the algorithm converges to values of E[Ĥ] that are N -dependent and diverge with N , whereas the variance of the distribution goes to an N -independent value (although noisy). The described phenomenology appeared for all the investigated values of c and k, and for both the block MAP and the symbol MAP. (d) Transition point λ + for the block MAP in the exponential model for c → +∞. The continuous line correspond to the prediction λ + = 4k given in section 6.2, whereas the dots are the numerical estimation of the transition point obtained using PD with c = 50 and a population of N = 10 8 fields.
it is found that (λ + − λ) 1/2 ln E[ ] → −α for λ → λ + for someα > 0, suggesting that the transition between the partial recovery phase and the full recovery phase is of infinite order also in this case. In figure 2(b) we give the PD results for E[ ] in this case, alongside with the results of the BP simulations.

A criterion for the block MAP transition
In this section we give a heuristic criterion for the transition between the partial recovery phase E[ ] > 0, and the full recovery phase E[ ] = 0 in the case of the block MAP. Our reasoning will follow and generalize the one given in reference [8] for the k = 1 case. Our approach is inspired by the physics literature on front propagation in reaction-diffusion systems and equations of the FKPP type [16][17][18]. Before applying it, however, an additional simplification of equation (28) must be performed. We will proceed in generality, assuming thatp depends on a parameter, let us call it λ. Moreover, we will assume that a special value λ exists such that for λ < λ we are in a partial-recovery phase, whereas for λ > λ we are in a full-recovery phase. We will also assume that the transition is continuous, i.e., E[ ] → 0 smoothly as λ → λ − . Observe that E[ ] → 0 means that Pr[H 1 + H 2 < Ω] → 1 and Pr[Ĥ 1 +Ĥ 2 >Ω] → 1, see equation (29). The first property implies that, approaching the transition, H 1 < Ω − H 2 almost surely, i.e., in equation (28b) the minimum picks almost surely one of the 'planted contributions'. Similarly, the second property implies that in the same limit the minimum in equation (28a) is almost surely picked in the set of 'non-planted contributions'. These observations lead us to introduce a new set of random variables (Û (n) , U (n) ) n , satisfying the iterative RDEs, with initial conditionÛ (0) = U (0) = 0, corresponding to the expected 'effective' behavior of equation (34) near the transition. The new set of auxiliary variables is informative on the behavior of the random variables (Ĥ (n) , H (n) ) n . Indeed, we can prove that Given two random variables X and Y, we say that X Y if P[X > z] P[Y > z] for all z [23]. The proof proceeds by induction. Equation (40) are satisfied for n = 0. Assuming that they are satisfied for given n, it is easily proved that they are satisfied for n + 1, being The result stated above implies that, if Pr[Û (n) > z] → 1 for n → +∞, then Pr[Ĥ (n) > z] → 1 as well in the same limit. We will obtain now a sufficient condition to have Pr[Û (n) > z] → 1 that we will give us, therefore, a (sufficient) criterion to be in the full recovery phase. We define Denoting by E X [•] the expectation with respect to the random variable X, we have that and therefore  Figure 2(c), in particular, shows that the means of the distributions are subject to a constant drift velocity that (at the leading order in n) does not depend on n. Moreover, this assumption is compatible with what has been observed in [8], and rigorously proved in [14,15], in the study of equation (38) for K ≡ 1. Then, for n → +∞, For x → −∞, F(x) → 0 by definition, and in this limit at first order in F The choice of the appropriate θ to estimate the drift velocity is, at this point, not obvious. It can be shown [14][15][16][17][18] that the relevant value θ * is the one that corresponds to the maximum velocity, i.e., θ * = arg sup θ>0 v(θ), and therefore If v = v(θ * ) > 0, then the distribution drifts towards +∞ and we are in a full recovery phase (Ĥ → +∞). We postulate that the marginal condition v(θ * ) = 0 ⇒ ln [I (θ * )I(1 − θ * )] = 0 corresponds to the transition point. Being ln(I(θ)I(1 − θ)) a convex function symmetric around θ = 1/2, one has v = 0 when I(1/2) = 1. This condition can be written as where is the Rényi divergence of order α. In an equivalent form, equation (52) can be written as The condition above generalizes the one obtained for the planted matching problem [8], that is recovered for k = 1. As a final comment, observe that, being by construction , full recovery by means of the block MAP implies full recovery by means of the symbol MAP, and the partial recovery interval obtained using the symbol MAP is contained in the partial recovery interval of the block MAP.

Examples of results
In this section, we consider some special formulations of the planted k-factor problem, and we compare our numerical results with the theoretical predictions obtained from the general criterion given in section 5.
The numerical results are obtained studying the RDEs (20) for the problem by means of a PD algorithm for β = 1 (symbol MAP) and β → +∞ (block MAP).
We also implemented a BP algorithm for the solution of the problem on actual graphs by means of the algorithm in equation (14). In particular, a random weighted graph with N vertices is generated according to the ensemble introduced in section 2. This graph is first subject to the pruning procedures described in sections 3.1 and 3.4. Given an edge e = (i, j) of the resulting graph, we associate two fields h i→e and h j→e to it, initialized to random values. The fields are updated using equation (14) if β > 0 or using equation (17) if β = +∞ for a large number of iterations. A candidate solutionF k is then selected using the criterion in equation (16). In all cases, we stopped the algorithm after 5N updates, or before if the setF k does not change for at least 50 iterations. The error is then obtained using equation (4), and the average error E[ ] is estimated considering a large number of independent instances of the problem.
The BP algorithm for the estimation of the block MAP given in equation (17) coincides with the BP algorithm for the minimum weight k-factor introduced by Bayati and coworkers [24]. They proved therein that the algorithm converges in polynomial time to the correct minimizer as long as there are no fractional solutions, i.e., solutions with non-integer values of m e . A worst-case analysis of the convergence properties of the BP algorithm with β = 1, on the other hand, is still missing. Observe that the BP algorithm has longer running time for c k when β is finite with respect to the β → +∞ case. In this case, indeed, given a node with valence κ, each step in equation (14) requires the sum of O(c κ ) contributions. Equation (17), instead, asks only for to the κth incoming field, an operation that requires O(c) steps. On complete graphs, having c = N − k − 1, this means that the algorithm running time is increased by a factor N k−1 in the finite β case with respect to the β = +∞ case.

The fully-connected case
Dense models are recovered in our setting considering the c → +∞ limit, to be taken after the N → +∞ limit. Assuming that μμ k = 0 (the problem is otherwise trivial) this implies γ → +∞ and thereforeq = q = 1 because of equation (25), meaning that in the thermodynamic limit there are (almost surely) no hard fields.
Equation (52) also implies that, if lim c→+∞ D 1/2 (p p) < +∞ no transition can take place in the fully connected limit. To get nontrivial results in this limit, it is therefore necessary to scale the weights with c, so that at the transition Suppose, for example, that p is c-independent, andp(w) ≡ c −a f (wc −a ; b) with a > 0 and b parameters, and some function f such that f 0 (b): = lim x→0 f(x; b) ∈ (0, +∞). Then the condition (55) implies that the threshold for c → +∞ is at , then the asymptotic formula (55) is not sufficient anymore and equation (54) must be considered. The threshold condition becomes The observations above are compatible with the rigorous results obtained by Bagaria and coworkers for the planted two-factor problem on complete graphs of N vertices for N → +∞ [12]. In their paper, they prove that, for k = 2, on the threshold the following limit holds under some assumptions on the distributions p andp (fulfilled, e.g., by Gaussian or exponential distributions, see reference [12] for details). This condition corresponds to equation (55), observing that on a complete graph c = N − k − 1. As we will show below, however, equation (58) can be not sufficient to recover the transition point.

The exponential model
Let us now briefly revisit the exponential case discussed in section 4. If we apply equation (54) to derive the block MAP threshold, we obtain Introducing s = ck and t = λk −1 , one parameter can be absorbed obtaining This equation is always solved by t = λ = 0. For s = ck 1.2277 . . . , two additional solutions for t, and therefore λ, appear, let us call them λ − and λ + , delimiting the partial recovery phase, see figure 3. In the c → +∞ limit, λ − 1/c → 0 and only one transition point is found This result is confirmed by the numerics, see figure 2(d). For k = 1 we recover the known result λ + = 4, rigorously proved in [7]. The criterion predicts that 4k − λ + approaches zero as e −ck for ck → +∞. Observe that in this case, considering c = N − k − 1, D 1/2 (p p) = ln N + O(N) for all values of k and λ > 0. In other words, equation (58) of [12], although verified on the transition, is not enough to recover the threshold.

Hidden Hamiltonian cycle recovery
In this section we are interested in solving a special type of planted two-factor problem, namely the hidden Hamiltonian cycle recovery (HC) problem. This is a planted two-factor problem in which the hidden two-factor is connected, i.e., it is a Hamiltonian cycle of the graph. The very same BP algorithms discussed for the planted two-factor can be applied to recover the hidden Hamiltonian cycle. This problem was studied in reference [12], with the planted weights being assumed to be normal variables,p = N (λ, 1), whereas the non-planted weights have distribution p = N (0, 1). In this case, therefore, D 1/2 (p p) = 1 4 λ 2 . Applying equation (58), a nontrivial transition in the block MAP estimator is expected at λ 2 = 4 ln N + o(ln N). Parametrizingp with λ 2 =λ 2 ln N, the transition is then atλ = 2. This is rigorously proved and numerically verified in reference [12], see figure 4. In figure 4 we also plot, for the block MAP case, the probability that the estimatorF 2 provided by the BP algorithm is actually connected, i.e., it is a single Hamiltonian cycle. Our numerics suggest that, forλ > 2, this probability goes to 1 as N → +∞.
As discussed in section 2, however, the block MAP is not the estimator that minimizes the error in equation (4). If we aim at minimizing the error , then the optimal estimator is the symbol MAP. We recall that this estimator does not provide a k-factor in general. Running our BP algorithm on graphs with hidden Hamiltonian cycles in the ensemble considered in [12], we instances of complete graphs with N = 100 with a hidden Hamiltonian cycle to be recovered. The black lines correspond to the PD prediction for the planted two-factor with c = 100, with same planted and non-planted weight distributions. The vertical line indicates the transition point between partial and full recovery predicted by the theory for the block MAP. In gray, we plot the probability that the block MAP estimatorF (b) k=2 is a Hamiltonian cycle (instead of a union or two or more cycles) for different sizes of the problem.
obtained the results in figure 4. The average error obtained using the symbol MAP is found to be smaller than the one obtained using the block MAP, as expected. On the other hand, finding the symbol MAP is computationally more expensive, as discussed above. For comparison, in figure 4 we plot also the PD results for the two-factor, finding a good agreement between the infinite-size prediction of the two-factor problem and the BP results of the HC problem also in the partial recovery phase.

Perspectives
The transition appearing in the planted k-factor problem is of the same type as found in the planted matching problem [7][8][9] and separates a partial recovery phase from a full recovery phase. Using heuristic arguments based on the literature on front propagation for reaction-diffusion equations, we have been able to obtain a simple and explicit criterion for the transition. We numerically tested the transition criterion, and we checked its consistency with the known results on the recovery thresholds of the planted two-factor problem. A rigorous proof of this transition criterion remains, however, as an open problem.
The heuristic argument is based on the fact (numerically observed) that the phase transition is continuous. It is not excluded a priori that first order transitions are possible for some nontrivial choice of the weight distributions or degree distributions of the graph, allowing the presence of multiple BP fixed points [25].
Finally, the threshold criterion obtained in the paper concerns the block MAP and only provides a bound for full recovery by means of the symbol MAP. In the numerically investigated cases, the recovery thresholds of the symbol MAP are observed to be very close to the ones of the block MAP. A formula for the exact location of the symbol MAP transition (and possibly its relation with the block MAP transition) is however still missing.